Skill Gap Analysis

Comparing Team Skills Against Market Demands

Authors
Affiliation

Tuba Anwar

Boston University

Kriti Singh

Boston University

Soham Deshkhaire

Boston University

Published

December 6, 2025

Modified

December 7, 2025

1 Executive Summary

This report presents a comprehensive analysis of the job market prospects for data analytics and related roles in 2024. Using a dataset of over 70,000 job postings, we conducted exploratory data analysis, natural language processing of job descriptions, skill gap analysis, and predictive modeling to understand current market trends and identify opportunities for career development.

Key Findings:

  • SAP Applications, Oracle Cloud, and Microsoft Office are the most in-demand software skills
  • Data Analysis capabilities appear in over 25,000 job postings
  • Significant skill gaps exist in enterprise software (SAP, Oracle) and cloud platforms (AWS, Azure)
  • Machine learning models achieved XX% accuracy in predicting job categories
  • Salary predictions show strong correlation with experience level and technical skill proficiency

2 Introduction


title: “Job Market Analysis 2024- Business Analytics, Data Science, and Machine Learning Trends” author: - name: Tuba Anwar affiliations: - id: bu name: Boston University city: Boston state: MA - name: Kriti Singh affiliations: - ref: bu - name: Soham Deshkhaire affiliations: - ref: bu format: html: toc: true number-sections: true df-print: paged code-overflow: wrap embed-resources: true css: styles.css bibliography: references.bib —

3 Introduction

3.1 Project Overview

This research explores the evolving landscape of Business Analytics, Data Science and Machine Learning (ML) careers in 2024. As organizations increasingly integrate AI in every function from operations to strategic decision-making, the demand for professionals with strong data and analytics skills is growing rapidly. This study explores four interconnected questions: the most sought-after skills for analytics and ML roles, how job descriptions are changing to require AI literacy, which industries are leading data-driven hiring, and what the short- to medium-term career outlook looks like for analytics professionals. By combining insights from labor market data and recent academic research, this project provides a holistic view of how AI and automation are transforming workforce expectations and employability strategies for students and professionals.


4 Research Rationale

AI, ML and data analytics have changed the global job market, creating both opportunities and challenges for new professionals. Organizations are looking for people with strong technical skills such as Python, SQL and ML modeling along with soft skills like communication, ethical reasoning and problem framing. Understanding these changing expectations is important for job seekers so they can match their academic learning and career plans with what the market is demanding.

This research aims to map out the connection between AI adoption, skill demand and career pathways. Examining these trends can help predict how automation, cloud computing, and generative AI will affect job prospects. Additionally, identifying leading sectors such as finance, healthcare and consulting in analytics hiring will clarify future opportunities and guide both individual career planning and institutional curriculum development.


5 Research Questions

Our analysis addresses the following core questions:

  1. What are the most in-demand technical and soft skills for data science, business analytics, and ML roles in 2024?
  2. How have job descriptions evolved to incorporate AI/ML requirements and competencies?
  3. Which industries are hiring the most data scientists and analytics professionals, and why?
  4. What is the realistic career outlook for business analytics professionals in the short to medium term?

6 Literature Review

Recent research highlights how analytics and ML skills are becoming increasingly central across industries. Job postings now demand expertise in the full data lifecycle, from engineering and modeling to interpretability along with governance and ethical considerations J. Liu, Chen, and Lyu (2024), Y. Liu and Li (2024). AI and ML proficiency is now a core requirement for business analytics, signaling a blend of traditional analytics and data science roles. Industries such as finance, healthcare and technology are leading this shift, relying on predictive and prescriptive analytics to drive strategy and innovation Smaldone et al. (2022).

The literature also points to an optimistic but dynamic career outlook. Georgieff and Hyee (2022) found that AI adoption does not necessarily displace analytics jobs but rather increases demand for those combining digital and domain expertise Georgieff and Hyee (2022). Gerhart et al. (2024) emphasize that communication and application of insights are as vital as technical proficiency Gerhart, Rastegari, and Cole (2024). Together, these studies illustrate that analytics professionals best positioned for long-term success are those with adaptable, interdisciplinary, and AI-literate skill sets in a rapidly evolving job market.

The evidence from the U.S. Bureau of Labor Statistics suggests strong growth in data science careers, with increasing demand for professionals who understand both technical analytics and business strategy U.S. Bureau of Labor Statistics (2024). Graham (2025) highlights emerging AI skills in technology sectors, demonstrating how new competencies are rapidly becoming market requirements Graham (2025). Liu and Li (2024) provide insights into talent cultivation in big data management, showing that employers seek candidates with 1-3 years of experience and strong foundational skills in data mining and analysis Y. Liu and Li (2024).

The evidence suggests that the analytics field is expanding rather than contracting. Organizations are not replacing human analysts with AI systems; instead, they are expanding their teams and seeking professionals who can work alongside AI tools. This presents a compelling opportunity for students entering the field, provided they develop both technical depth and the soft skills necessary to translate data insights into business value.


7 Project Scope & Methodology

7.1 Data Source

We will utilize the Lightcast 2024 dataset, which includes: - Job posting volumes for analytics, data science, and ML roles by industry and location - Salary data across data-related positions and geographies - Skill requirements extracted from job descriptions - Hiring trends across time periods - Company size and industry classifications - Emerging role titles and job requirements

7.2 Analysis Approach

Our team will:

  1. Clean and preprocess the Lightcast data using Python (pandas, NumPy)
  2. Extract and categorize skills mentioned in job descriptions for analytics and ML roles
  3. Compare trends across industries, geographies, and job levels
  4. Analyze salary patterns to understand compensation for different skill combinations
  5. Identify emerging roles and how job requirements are evolving
  6. Visualize findings through interactive dashboards using Plotly and Matplotlib
  7. Develop career strategy recommendations based on market insights

7.3 Expected Findings

We anticipate discovering that: - Python, SQL, and cloud platforms (AWS, GCP, Azure) remain foundational skills - Generative AI and prompt engineering are emerging as newly valued competencies - Finance, healthcare, and technology sectors lead in analytics hiring - Soft skills like communication and domain expertise are increasingly emphasized - Analytics roles offer strong job security and career growth potential - Salary premiums exist for ML/AI-specialized roles compared to traditional business analytics

7.4 Deliverables (Future Phases)

  • Exploratory Data Analysis (EDA) with visualizations
  • Interactive dashboards showing skill trends, industry hiring patterns, and salary insights
  • Career pathway recommendations for different specializations
  • Personal career action plans for each team member

8 References

9 Data Cleaning and Preparation

9.1 Introduction

This document details the comprehensive data cleaning and preprocessing steps applied to the 2024 job market dataset.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load raw data
df = pd.read_csv('lightcast_job_postings.csv')

print(f"Initial dataset shape: {df.shape}")
print(f"Total rows: {df.shape[0]:,}")
print(f"Total columns: {df.shape[1]}")
Initial dataset shape: (72498, 131)
Total rows: 72,498
Total columns: 131

9.2 Step 1: Removing Redundant Columns

We remove redundant columns to improve dataset quality and analysis efficiency.

# List of columns to drop
columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
    "BODY", "TITLE_RAW", "COMPANY_RAW", "ACTIVE_SOURCES_INFO",
    "NAICS2", "NAICS2_NAME", "NAICS3", "NAICS3_NAME", 
    "NAICS4", "NAICS4_NAME", "NAICS5", "NAICS5_NAME", 
    "NAICS6", "NAICS6_NAME",
    "SOC_2", "SOC_2_NAME", "SOC_3", "SOC_3_NAME", 
    "SOC_4", "SOC_4_NAME", "SOC_5", "SOC_5_NAME",
    "SOC_2021_2", "SOC_2021_2_NAME", "SOC_2021_3", "SOC_2021_3_NAME", 
    "SOC_2021_5", "SOC_2021_5_NAME",
    "LOT_CAREER_AREA", "LOT_CAREER_AREA_NAME", "LOT_OCCUPATION", "LOT_OCCUPATION_NAME",
    "LOT_SPECIALIZED_OCCUPATION", "LOT_SPECIALIZED_OCCUPATION_NAME", 
    "LOT_OCCUPATION_GROUP", "LOT_OCCUPATION_GROUP_NAME",
    "LOT_V6_SPECIALIZED_OCCUPATION", "LOT_V6_SPECIALIZED_OCCUPATION_NAME", 
    "LOT_V6_OCCUPATION", "LOT_V6_OCCUPATION_NAME",
    "LOT_V6_OCCUPATION_GROUP", "LOT_V6_OCCUPATION_GROUP_NAME", 
    "LOT_V6_CAREER_AREA", "LOT_V6_CAREER_AREA_NAME",
    "ONET_2019", "ONET_2019_NAME", "CIP6", "CIP6_NAME", "CIP2", "CIP2_NAME",
    "COUNTY", "COUNTY_NAME", "COUNTY_OUTGOING", "COUNTY_NAME_OUTGOING", 
    "COUNTY_INCOMING", "COUNTY_NAME_INCOMING",
    "MSA", "MSA_OUTGOING", "MSA_INCOMING",
    "SALARY_TO", "SALARY_FROM", "ORIGINAL_PAY_PERIOD",
    "MODELED_EXPIRED", "MODELED_DURATION"
]

# Drop only columns that exist
columns_to_drop = [col for col in columns_to_drop if col in df.columns]
df_cleaned = df.drop(columns=columns_to_drop, inplace=False)

print(f"Columns removed: {len(columns_to_drop)}")
print(f"New dataset shape: {df_cleaned.shape}")
Columns removed: 69
New dataset shape: (72498, 62)

9.3 Step 2: Handling Missing Values

# Missing value statistics
missing_stats = pd.DataFrame({
    'Column': df_cleaned.columns,
    'Missing_Count': df_cleaned.isnull().sum().values,
    'Missing_Percentage': (df_cleaned.isnull().sum().values / len(df_cleaned) * 100).round(2)
})
missing_stats = missing_stats[missing_stats['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print(f"Columns with missing values: {len(missing_stats)}")
print("\nTop 10 columns with highest missing percentages:")
print(missing_stats.head(10))
Columns with missing values: 62

Top 10 columns with highest missing percentages:
                    Column  Missing_Count  Missing_Percentage
18    MAX_YEARS_EXPERIENCE          64068               88.37
13           MAX_EDULEVELS          56183               77.50
14      MAX_EDULEVELS_NAME          56183               77.50
50       LIGHTCAST_SECTORS          54711               75.47
51  LIGHTCAST_SECTORS_NAME          54711               75.47
20                  SALARY          41690               57.51
3                 DURATION          27316               37.68
17    MIN_YEARS_EXPERIENCE          23146               31.93
2                  EXPIRED           7844               10.82
30       MSA_NAME_INCOMING           3962                5.46
# Drop columns with >50% missing values
threshold = len(df_cleaned) * 0.5
cols_before = len(df_cleaned.columns)
df_cleaned = df_cleaned.dropna(thresh=threshold, axis=1)
cols_after = len(df_cleaned.columns)

print(f"Columns dropped due to >50% missing values: {cols_before - cols_after}")

# Fill numerical columns with median
numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()
for col in numerical_cols:
    if df_cleaned[col].isnull().sum() > 0:
        median_val = df_cleaned[col].median()
        df_cleaned[col].fillna(median_val, inplace=True)

# Fill categorical columns with "Unknown"
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns.tolist()
for col in categorical_cols:
    if df_cleaned[col].isnull().sum() > 0:
        df_cleaned[col].fillna("Unknown", inplace=True)

print(f"Total missing values after imputation: {df_cleaned.isnull().sum().sum()}")
Columns dropped due to >50% missing values: 6
Total missing values after imputation: 0

9.4 Step 3: Removing Duplicates

# Check for duplicates
initial_rows = len(df_cleaned)
duplicates_count = df_cleaned.duplicated(subset=['TITLE_NAME', 'COMPANY_NAME', 'LOCATION']).sum()

print(f"Initial rows: {initial_rows:,}")
print(f"Duplicate rows detected: {duplicates_count:,}")

# Remove duplicates
df_cleaned = df_cleaned.drop_duplicates(
    subset=['TITLE_NAME', 'COMPANY_NAME', 'LOCATION'], 
    keep='first'
)

final_rows = len(df_cleaned)
print(f"Final rows after duplicate removal: {final_rows:,}")
print(f"Rows removed: {initial_rows - final_rows:,}")
Initial rows: 72,498
Duplicate rows detected: 13,278
Final rows after duplicate removal: 59,220
Rows removed: 13,278

9.5 Step 4: Final Summary

# Create summary statistics
summary_stats = pd.DataFrame({
    'Metric': [
        'Total Rows',
        'Total Columns',
        'Numerical Columns',
        'Categorical Columns',
        'Missing Values'
    ],
    'Value': [
        f"{len(df_cleaned):,}",
        f"{len(df_cleaned.columns)}",
        f"{len(df_cleaned.select_dtypes(include=[np.number]).columns)}",
        f"{len(df_cleaned.select_dtypes(include=['object']).columns)}",
        f"{df_cleaned.isnull().sum().sum()}"
    ]
})

print(summary_stats.to_string(index=False))

# Save cleaned dataset
df_cleaned.to_csv('cleanedjob_postings.csv', index=False)
print("\nCleaned dataset saved as 'cleanedjob_postings.csv'")
             Metric  Value
         Total Rows 59,220
      Total Columns     56
  Numerical Columns     12
Categorical Columns     44
     Missing Values      0

Cleaned dataset saved as 'cleanedjob_postings.csv'

9.6 Summary

The data cleaning process has successfully prepared the job market dataset:

  • Removed redundant columns
  • Handled missing values strategically
  • Removed duplicate postings
  • Final clean dataset ready for analysis

10 Exploratory Data Analysis


title: “Exploratory Data Analysis” subtitle: “Uncovering Job Market Trends and Patterns in 2024” format: html: toc: true number-sections: true df-print: paged code-overflow: wrap embed-resources: true css: styles.css jupyter: python3 execute: echo: false warning: false message: false —

10.1 Introduction

This document presents a comprehensive exploratory data analysis of the 2024 job market. We examine hiring trends, employment types, geographic patterns, and remote work opportunities to provide actionable insights for job seekers and market analysts.

import pandas as pd
import numpy as np
import hvplot.pandas
import panel as pn
import warnings
warnings.filterwarnings('ignore')

# Enable panel for rendering
pn.extension()

# Updated color palette - blue, pink, purple, red
COLORS = ['#2196F3', '#E91E63', '#9B59B6', '#F44336', '#00BCD4', '#FF4081', '#7E57C2', '#EF5350']

# Load cleaned data
df = pd.read_csv('cleanedjob_postings.csv')

#print(f"Dataset shape: {df.shape}")
#print(f"Analysis period: 2024 Job Market")
#print(f"Total job postings analyzed: {len(df):,}")

10.2 1. Job Title Analysis

10.2.1 1.1 Top In-Demand Job Titles

Understanding which job titles dominate the market helps identify high-demand roles and emerging career opportunities.

# Get top 10 job titles 
title_counts = df[df['TITLE_NAME'].notna()]['TITLE_NAME'].value_counts().head(10)
title_df = pd.DataFrame({
    'Job Title': title_counts.index, 
    'Count': title_counts.values,
    'Percentage': (title_counts.values / len(df[df['TITLE_NAME'].notna()]) * 100).round(2)
})

print(f"\nTop job titles : {len(title_counts)} titles")

# Create interactive horizontal bar chart
title_df.hvplot.barh(
    x='Job Title',
    y='Count',
    title='Top 10 Most In-Demand Job Titles (2024)',
    height=500,
    width=800,
    color='#2196F3',
    hover_cols=['Percentage'],
    ylabel='',
    xlabel='Number of Job Postings',
    flip_yaxis=True
).opts(xformatter='%.0f')

Top job titles : 10 titles

The Top 10 most in-demand jobs are dominated by Data Analyst roles, which appear far more frequently than any other title. Business Intelligence Analysts, Enterprise Architects, and Data Modelers also rank highly, highlighting strong employer demand for data-focused and technical architecture skills. Overall, the top roles show a clear market emphasis on analytics, data management, and solution-oriented positions in 2024.

10.3 2. Employment Type Distribution

10.3.1 2.1 Full-Time vs Part-Time vs Contract

Understanding employment type distribution helps job seekers target positions matching their career preferences.

# Count jobs by employment type 
df_employment = df[df['EMPLOYMENT_TYPE_NAME'].notna()].copy()

# Remove "Unknown" values
df_employment = df_employment[df_employment['EMPLOYMENT_TYPE_NAME'] != 'Unknown']

employment_counts = df_employment['EMPLOYMENT_TYPE_NAME'].value_counts()
employment_df = pd.DataFrame({
    'Employment Type': employment_counts.index, 
    'Count': employment_counts.values,
    'Percentage': (employment_counts.values / employment_counts.sum() * 100).round(1)
})

#print(f"\nEmployment types : {len(df_employment):,} jobs")
#print(employment_df.to_string(index=False))

# Create bar chart
employment_df.hvplot.bar(
    x='Employment Type',
    y='Count',
    title='Job Market Distribution by Employment Type (2024)',
    height=400,
    width=700,
    color='Employment Type',
    cmap=COLORS,
    hover_cols=['Percentage'],
    ylabel='Number of Job Postings',
    rot=45
).opts(yformatter='%.0f')

Most job postings in 2024 require around 5 years of experience, making it the dominant requirement across roles. Entry-level opportunities (0–2 years) exist but are significantly fewer, showing that employers prioritize mid-level talent with proven skills. Requirements beyond 7+ years drop sharply, indicating limited demand for highly senior roles compared to mid-level positions, jobs that are full-time, makes up over 95% of all postings only a small share of roles are part-time (3.3%) or mixed part-time/full-time (1.4%). This shows that employers mainly prefer hiring for stable, full-time positions.

10.4 3. Remote Work Analysis

10.4.1 3.1 Remote vs Hybrid vs On-Site

The prevalence of remote work reflects post-pandemic hiring trends and employer flexibility.

# Count jobs by remote type and remove not specified values
df_remote = df.copy()

# Replace None, NaN, Unknown, empty strings with NaN
df_remote['REMOTE_TYPE_NAME'] = df_remote['REMOTE_TYPE_NAME'].replace({
    '[None]': np.nan,
    'None': np.nan,
    'Unknown': np.nan,
    '': np.nan,
    'Not Specified': np.nan
})

# Drop rows with NaN in REMOTE_TYPE_NAME
df_remote = df_remote[df_remote['REMOTE_TYPE_NAME'].notna()]

#print(f"\nRemote type distribution (excluding not specified):")
#print(f"Total jobs with remote info: {len(df_remote):,}")

# Count jobs by remote type
remote_counts = df_remote['REMOTE_TYPE_NAME'].value_counts()
remote_df = pd.DataFrame({
    'Remote Type': remote_counts.index,
    'Count': remote_counts.values,
    'Percentage': (remote_counts.values / remote_counts.sum() * 100).round(1)
})

#print(remote_df.to_string(index=False))

# Create bar chart with custom colors
remote_df.hvplot.bar(
    x='Remote Type',
    y='Count',
    title='Job Market Distribution: Remote, Hybrid & On-Site Opportunities (2024)',
    height=400,
    width=700,
    color='Remote Type',
    cmap=['#2196F3', '#E91E63', '#9B59B6', '#F44336'],
    hover_cols=['Percentage'],
    ylabel='Number of Job Postings',
    xlabel='Work Location Type',
    rot=0
).opts(yformatter='%.0f')

The dominance of full-time roles in 2024 reflects a strong post-COVID recovery, as companies shift back toward stable, long-term hiring after years of uncertainty. The very small share of part-time and hybrid hour roles suggests that businesses are prioritizing consistent workforce availability to meet rising operational demands. Overall, the trend indicates increased employer confidence and a return to pre-pandemic hiring patterns focused on full-time talent.

10.5 4. Geographic Analysis

10.5.1 4.1 Top States for Job Opportunities

Geographic distribution shows where job opportunities are concentrated.

# Get top 10 states by job postings 
df_states = df[df['STATE_NAME'].notna()].copy()

state_counts = df_states['STATE_NAME'].value_counts().head(10)
state_df = pd.DataFrame({
    'State': state_counts.index,
    'Job Postings': state_counts.values,
    'Percentage': (state_counts.values / len(df_states) * 100).round(2)
})

#print(f"\nStates with job postings : {len(df_states):,} jobs")

# Create bar chart
state_df.hvplot.bar(
    x='State',
    y='Job Postings',
    title='Top 10 States by Number of Job Postings (2024)',
    height=450,
    width=850,
    color='#9B59B6',
    hover_cols=['Percentage'],
    ylabel='Number of Job Postings',
    xlabel='State',
    rot=45
).opts(yformatter='%.0f')

The top 10 states for job postings in 2024 are led by Texas and California, which together dominate the job market with substantially higher opportunities than the rest. States like Virginia, Florida, New York, and Illinois follow, reflecting strong demand in both tech-heavy and fast-growing regional economies. Overall, job opportunities are concentrated in major economic hubs and states with strong corporate, technology, and government sectors.

10.6 5. Top Hiring Companies

10.6.1 5.1 Companies with Most Job Openings

Identifying top hiring companies helps job seekers target organizations with multiple opportunities.

# Get top 10 companies by job postings (remove None values)
df_companies = df[df['COMPANY_NAME'].notna()].copy()

company_counts = df_companies['COMPANY_NAME'].value_counts().head(10)
company_df = pd.DataFrame({
    'Company': company_counts.index,
    'Job Postings': company_counts.values
})

#print(f"\nCompanies with job postings : {len(df_companies):,} jobs")

# Create horizontal bar chart
company_df.hvplot.barh(
    x='Company',
    y='Job Postings',
    title='Top 10 Companies by Number of Job Postings (2024)',
    height=550,
    width=800,
    color='#E91E63',
    ylabel='',
    xlabel='Number of Job Postings',
    flip_yaxis=True
).opts(xformatter='%.0f')

The Top 10 hiring companies in 2024 are led by Deloitte and Accenture, which show significantly higher job postings than other firms,reflecting strong demand for consulting, technology, and analytics roles. Companies like PwC, Insight Global, KPMG, and Lumen Technologies also appear prominently, indicating consistent hiring across both consulting and IT services sectors. Overall, the hiring landscape is dominated by large professional services firms, showcasing continued growth in advisory, digital transformation, and data-driven business roles.

10.7 6. Job Posting Timeline

10.7.1 6.1 When Jobs Are Posted

Understanding posting patterns helps with strategic job search timing. Since job postings peaked in August and September 2024, job seekers should target this late-summer period for applications. Companies appear to increase hiring toward the end of Q3, meaning more openings, faster responses, and higher chances of landing interviews. Preparing resumes, portfolios, and applications ahead of this surge (during May–July) can give job seekers a strategic advantage when the market becomes most active.

# Print diagnostics
#print("\nAnalyzing POSTED column:")
#print(f"Sample values: {df['POSTED'].head(10)}")
#print(f"Data type: {df['POSTED'].dtype}")

# Convert POSTED to datetime
df_time = df.copy()
df_time['POSTED'] = pd.to_datetime(df_time['POSTED'], errors='coerce')

#print(f"\nAfter conversion:")
#print(f"Date range: {df_time['POSTED'].min()} to {df_time['POSTED'].max()}")
#print(f"Valid dates: {df_time['POSTED'].notna().sum():,} out of {len(df_time):,}")
#print(f"Missing dates: {df_time['POSTED'].isna().sum():,}")

# Filter valid dates
time_data = df_time[df_time['POSTED'].notna()].copy()

if len(time_data) > 0:
    # Extract year-month
    time_data['YearMonth'] = time_data['POSTED'].dt.to_period('M')
    
    # Group by month and count
    monthly_counts = time_data.groupby('YearMonth').size().reset_index(name='Job Postings')
    monthly_counts['Month'] = monthly_counts['YearMonth'].astype(str)
    
    # Sort by month
    monthly_counts = monthly_counts.sort_values('Month')
    
    #print(f"\nMonthly posting counts:")
    #print(monthly_counts[['Month', 'Job Postings']].to_string(index=False))
    
    # Create line and area chart
    line_chart = monthly_counts.hvplot.line(
        x='Month',
        y='Job Postings',
        title='Job Posting Trends Over Time (2024)',
        height=400,
        width=850,
        color='#2196F3',
        line_width=3,
        ylabel='Number of Job Postings',
        xlabel='Month',
        rot=45
    ).opts(yformatter='%.0f')
    
    area_chart = monthly_counts.hvplot.area(
        x='Month',
        y='Job Postings',
        alpha=0.3,
        color='#2196F3'
    )
    
    # Combine line and area
    (line_chart * area_chart)
else:
    #print("\nWarning: No valid dates found in POSTED column")
    ##print("Sample of POSTED values:")
    print(df['POSTED'].head(20))

10.8 7. Experience Requirements

10.8.1 7.1 Minimum Years of Experience Required

Understanding experience requirements helps assess job market accessibility.

# Analyze minimum years of experience (remove None values)
df_experience = df[df['MIN_YEARS_EXPERIENCE'].notna()].copy()

exp_counts = df_experience['MIN_YEARS_EXPERIENCE'].value_counts().sort_index()
exp_df = pd.DataFrame({
    'Years of Experience': exp_counts.index,
    'Job Postings': exp_counts.values
})

# Convert Years of Experience to string for better labels
exp_df['Years Label'] = exp_df['Years of Experience'].astype(int).astype(str) + ' years'

#print(f"\nJobs with experience requirements (excluding None): {len(df_experience):,} jobs")
#print(exp_df[['Years of Experience', 'Job Postings']].to_string(index=False))

# Create bar chart with pastel purple color
exp_df.hvplot.bar(
    x='Years Label',
    y='Job Postings',
    title='Job Postings by Minimum Years of Experience Required (2024)',
    height=400,
    width=800,
    color='#B39DDB',
    ylabel='Number of Job Postings',
    xlabel='Minimum Years of Experience',
    rot=45
).opts(yformatter='%.0f')

Most job postings in 2024 require around 5 years of experience, making it the dominant requirement across roles. Entry-level opportunities (0–2 years) exist but are significantly fewer, showing that employers prioritize mid-level talent with proven skills. Requirements beyond 7+ years drop sharply, indicating limited demand for highly senior roles compared to mid-level positions.

10.9 References

  • Data source: Lightcast Job Postings Dataset (2024)
  • Visualization tools: hvPlot, Panel, Python
  • Analysis framework: Standard EDA best practices

11 Data Analysis Methods

11.1 Introduction

This document provides a comprehensive analysis of job market data for 2024. We perform data cleaning, handle missing values, remove duplicates, and conduct exploratory data analysis to understand hiring trends, salary distributions, and job market dynamics.

import pandas as pd
import numpy as np
import hvplot.pandas
import holoviews as hv
import warnings
warnings.filterwarnings('ignore')

# Initialize hvplot with bokeh backend
hv.extension('bokeh')

# Load your data
df = pd.read_csv('lightcast_job_postings.csv')

11.2 Step 1: Removing Redundant Columns

11.2.1 Why Remove These Columns?

We remove redundant columns to improve dataset quality and analysis efficiency. Specifically:

  • Tracking & Administrative Columns: ID, URL, DUPLICATES, LAST_UPDATED_TIMESTAMP - These are metadata and don’t contribute to analysis
  • Raw Text Fields: BODY, TITLE_RAW, COMPANY_RAW - We have cleaned versions (TITLE, COMPANY_NAME, TITLE_CLEAN)
  • Deprecated Classifications: We keep only the latest NAICS_2022_6, SOC_2021_4, and CIP4 standards
  • Duplicate Geographic Fields: Multiple versions of county/MSA data create redundancy
# List of columns to drop
columns_to_drop = [
    # Administrative & tracking columns
    "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
    # Raw text columns (we have cleaned versions)
    "BODY", "TITLE_RAW", "COMPANY_RAW", "ACTIVE_SOURCES_INFO",
    # Deprecated NAICS versions (keeping NAICS_2022_6)
    "NAICS2", "NAICS2_NAME", "NAICS3", "NAICS3_NAME", 
    "NAICS4", "NAICS4_NAME", "NAICS5", "NAICS5_NAME", 
    "NAICS6", "NAICS6_NAME",
    # Deprecated SOC versions (keeping SOC_2021_4)
    "SOC_2", "SOC_2_NAME", "SOC_3", "SOC_3_NAME", 
    "SOC_4", "SOC_4_NAME", "SOC_5", "SOC_5_NAME",
    "SOC_2021_2", "SOC_2021_2_NAME", "SOC_2021_3", "SOC_2021_3_NAME", 
    "SOC_2021_5", "SOC_2021_5_NAME",
    # Deprecated occupation classifications
    "LOT_CAREER_AREA", "LOT_CAREER_AREA_NAME", "LOT_OCCUPATION", "LOT_OCCUPATION_NAME",
    "LOT_SPECIALIZED_OCCUPATION", "LOT_SPECIALIZED_OCCUPATION_NAME", 
    "LOT_OCCUPATION_GROUP", "LOT_OCCUPATION_GROUP_NAME",
    "LOT_V6_SPECIALIZED_OCCUPATION", "LOT_V6_SPECIALIZED_OCCUPATION_NAME", 
    "LOT_V6_OCCUPATION", "LOT_V6_OCCUPATION_NAME",
    "LOT_V6_OCCUPATION_GROUP", "LOT_V6_OCCUPATION_GROUP_NAME", 
    "LOT_V6_CAREER_AREA", "LOT_V6_CAREER_AREA_NAME",
    # Deprecated CIP and ONET versions
    "ONET_2019", "ONET_2019_NAME", "CIP6", "CIP6_NAME", "CIP2", "CIP2_NAME",
    # Duplicate geographic fields
    "COUNTY", "COUNTY_NAME", "COUNTY_OUTGOING", "COUNTY_NAME_OUTGOING", 
    "COUNTY_INCOMING", "COUNTY_NAME_INCOMING",
    "MSA", "MSA_OUTGOING", "MSA_INCOMING",
    # Deprecated salary fields (keeping SALARY)
    "SALARY_TO", "SALARY_FROM", "ORIGINAL_PAY_PERIOD",
    # Model versions (keep actual data)
    "MODELED_EXPIRED", "MODELED_DURATION"
]

# Drop only columns that exist in the dataset
columns_to_drop = [col for col in columns_to_drop if col in df.columns]
df_cleaned = df.drop(columns=columns_to_drop, inplace=False)

11.3 Step 2: Handling Missing Values

11.3.1 Understanding Missing Data

Before imputation, let’s visualize where data is missing:

# Missing value statistics
missing_stats = pd.DataFrame({
    'Column': df_cleaned.columns,
    'Missing_Count': df_cleaned.isnull().sum().values,
    'Missing_Percentage': (df_cleaned.isnull().sum().values / len(df_cleaned) * 100).round(2)
})
missing_stats = missing_stats[missing_stats['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

# Interactive missing values visualization
if len(missing_stats) > 0:
    # Select top 10 for cleaner visualization
    top_missing = missing_stats.head(10)
    missing_plot = top_missing.hvplot.barh(
        x='Column', 
        y='Missing_Percentage',
        title='Top 10 Columns with Missing Values',
        xlabel='Missing Percentage (%)',
        ylabel='Column Name',
        height=450,
        width=900,
        color='#e74c3c',
        hover_cols=['Missing_Count', 'Missing_Percentage']
    ).opts(invert_yaxis=True)
    missing_plot

11.3.2 Missing Value Imputation Strategy

We applied a strategic approach to handle missing data:

  • Dropped columns with more than 50% missing values
  • Filled numerical columns with median values to maintain distribution
  • Filled categorical columns with “Unknown” for clarity
# Strategy: Drop columns with >50% missing values
threshold = len(df_cleaned) * 0.5
cols_before = len(df_cleaned.columns)
df_cleaned = df_cleaned.dropna(thresh=threshold, axis=1)
cols_after = len(df_cleaned.columns)

# Fill numerical columns with median
numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()
for col in numerical_cols:
    if df_cleaned[col].isnull().sum() > 0:
        median_val = df_cleaned[col].median()
        df_cleaned[col].fillna(median_val, inplace=True)

# Fill categorical columns with "Unknown"
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns.tolist()
for col in categorical_cols:
    if df_cleaned[col].isnull().sum() > 0:
        df_cleaned[col].fillna("Unknown", inplace=True)

11.4 Step 3: Removing Duplicates

11.4.1 Why Remove Duplicates?

Duplicate job postings skew our analysis. We remove them based on job title, company, and location to ensure each unique position is counted only once.

# Remove duplicates based on key identifiers
df_cleaned = df_cleaned.drop_duplicates(
    subset=['TITLE_NAME', 'COMPANY_NAME', 'LOCATION'], 
    keep='first'
)

11.5 Step 4: Exploratory Data Analysis (EDA)

11.5.1 Visualization 1: Top Job Titles

Understanding which job titles are most in demand helps job seekers identify the skills and roles to target.

# Get top 10 job titles
title_counts = df_cleaned['TITLE_NAME'].value_counts().head(10)
title_df = pd.DataFrame({'Job Title': title_counts.index, 'Count': title_counts.values})

# Create interactive plot
title_plot = title_df.hvplot.barh(
    x='Job Title',
    y='Count',
    title='Top 10 Job Titles by Postings (2024)',
    xlabel='Number of Job Postings',
    ylabel='Job Title',
    height=500,
    width=1000,
    color='#3498db',
    hover_cols='all'
).opts(invert_yaxis=True)

title_plot
NoteKey Insight

The top job titles show where the job market is most active. This data helps identify roles with highest demand for new talent.

11.5.2 Visualization 2: Job Postings by Employment Type

Employment type indicates the nature of positions available in the job market. Understanding the distribution helps job seekers target positions that match their preferences.

# Count jobs by employment type
employment_counts = df_cleaned['EMPLOYMENT_TYPE_NAME'].value_counts()
employment_df = pd.DataFrame({
    'Employment Type': employment_counts.index, 
    'Count': employment_counts.values,
    'Percentage': (employment_counts.values / employment_counts.sum() * 100).round(1)
})

# Create interactive scatter plot with size
employment_plot = employment_df.hvplot.scatter(
    x='Employment Type',
    y='Percentage',
    size='Count',
    title='Job Market Distribution by Employment Type (2024)',
    xlabel='Employment Type',
    ylabel='Percentage of Total Jobs (%)',
    height=500,
    width=1000,
    color='#e74c3c',
    hover_cols=['Count', 'Percentage'],
    alpha=0.6,
    size_max=1000
)

employment_plot
WARNING:param.main: size_max option not found for scatter plot with bokeh; similar options include: ['size']
NoteKey Insight

The distribution of employment types shows whether the market is primarily offering full-time, part-time, or contract positions, helping job seekers identify opportunities that match their work preferences.

11.5.3 Visualization 3: Remote Work Distribution

The prevalence of remote work reflects post-pandemic hiring trends and work flexibility.

# Count jobs by remote type
remote_counts = df_cleaned['REMOTE_TYPE_NAME'].value_counts()
remote_df = pd.DataFrame({
    'Remote Type': remote_counts.index,
    'Count': remote_counts.values,
    'Percentage': (remote_counts.values / remote_counts.sum() * 100).round(1)
})

# Create interactive bar chart
remote_plot = remote_df.hvplot.bar(
    x='Remote Type',
    y='Count',
    title='Job Market Distribution: Remote vs. On-Site (2024)',
    xlabel='Remote Type',
    ylabel='Number of Job Postings',
    height=500,
    width=900,
    color='#2E8B57',
    hover_cols=['Count', 'Percentage']
)

remote_plot
NoteKey Insight

The balance between remote and on-site jobs shows employers’ flexibility preferences. Higher remote job percentages indicate industry acceptance of distributed workforces.

11.5.4 Visualization 4: Top Companies Hiring

Understanding which companies are actively hiring helps job seekers identify potential employers with multiple open positions.

# Get top 10 companies by job postings
company_counts = df_cleaned['COMPANY_NAME'].value_counts().head(10)
company_df = pd.DataFrame({
    'Company': company_counts.index,
    'Job Postings': company_counts.values
})

# Create area chart
company_plot = company_df.hvplot.area(
    x='Company',
    y='Job Postings',
    title='Top 10 Companies by Number of Job Postings (2024)',
    xlabel='Company Name',
    ylabel='Number of Job Postings',
    height=500,
    width=1000,
    color='#9b59b6',
    line_width=2,
    alpha=0.5,
    hover_cols='all'
).opts(xrotation=45)

company_plot
NoteKey Insight

Companies with the highest number of job postings indicate organizations that are actively expanding their workforce and may offer more opportunities for candidates.

11.6 Summary of Data Cleaning & Analysis

ImportantSummary

The data cleaning process successfully prepared the job market dataset for analysis. We removed redundant administrative and deprecated classification columns, handled missing values through strategic imputation, and eliminated duplicate job postings.

The exploratory analysis revealed key insights into job market trends, including:

  • High-demand roles: Identification of the most sought-after job titles in 2024
  • Employment stability: Understanding the distribution of full-time, part-time, and contract positions
  • Workplace flexibility: Analysis of remote, hybrid, and on-site work opportunities
  • Salary trends: Clear understanding of compensation ranges across different job categories

These insights provide valuable guidance for job seekers, employers, and market analysts in understanding the current state of the 2024 job market.

11.7 References

All data sourced from Lightcast Job Postings Dataset (2024). Analysis performed using Python with pandas, hvplot, and holoviews libraries.

12 Natural Language Processing Analysis


title: “Natural Language Processing Methods” subtitle: “Extracting Insights from Job Descriptions” format: html: toc: true number-sections: true df-print: paged code-overflow: wrap embed-resources: true css: styles.css jupyter: python3 execute: echo: false warning: false message: false —

12.1 Introduction

This section applies Natural Language Processing (NLP) techniques to analyze job postings and extract meaningful insights about skills, requirements, and industry trends. By processing structured skills data, we can uncover patterns about the most in-demand competencies.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import warnings
warnings.filterwarnings('ignore')

# Set seaborn style
sns.set_style("whitegrid")
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (8, 5)

# Load data
df = pd.read_csv('cleanedjob_postings.csv')

print(f"Dataset loaded: {df.shape[0]:,} rows, {df.shape[1]} columns")

# Identify skills columns
skills_columns = [col for col in df.columns if 'SKILLS_NAME' in col or 'SKILLS' in col]
print(f"Skills columns found: {skills_columns}")
Dataset loaded: 59,220 rows, 56 columns
Skills columns found: ['SKILLS', 'SKILLS_NAME', 'SPECIALIZED_SKILLS', 'SPECIALIZED_SKILLS_NAME', 'COMMON_SKILLS', 'COMMON_SKILLS_NAME', 'SOFTWARE_SKILLS', 'SOFTWARE_SKILLS_NAME']

12.2 1. Data Preparation

12.2.1 1.1 Skills Data Overview

# Combine all skills into a single text field for analysis
def combine_skills(row):
    """Combine all skills columns into a single text"""
    skills = []
    for col in ['SKILLS_NAME', 'SPECIALIZED_SKILLS_NAME', 'COMMON_SKILLS_NAME', 'SOFTWARE_SKILLS_NAME']:
        if col in df.columns and pd.notna(row.get(col)):
            skills.append(str(row[col]))
    return ' | '.join(skills)

df['all_skills'] = df.apply(combine_skills, axis=1)
df_skills = df[df['all_skills'].str.len() > 0].copy()

print(f"✓ Successfully processed: {len(df_skills):,} job postings with skills")
print(f"✓ Average skills text length: {df_skills['all_skills'].str.len().mean():.0f} characters")
✓ Successfully processed: 59,220 job postings with skills
✓ Average skills text length: 1204 characters

12.3 2. Keyword Extraction

12.3.1 2.1 Top 10 Most Common Skills

# Extract most common skill keywords (excluding the extreme outlier)
if len(df_skills) > 0:
    all_skills = []
    
    for skills_text in df_skills['all_skills']:
        # Split by pipe separator and clean thoroughly
        skills = [s.strip() for s in str(skills_text).split('|')]
        # Remove quotes, brackets, and clean each skill
        cleaned_skills = []
        for skill in skills:
            clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '')
            clean = clean.replace('(', '').replace(')', '').strip().lower()
            if len(clean) > 2:  # Only keep skills with more than 2 characters
                cleaned_skills.append(clean)
        all_skills.extend(cleaned_skills)
    
    # Remove empty strings
    all_skills = [s for s in all_skills if len(s) > 0]
    
    if len(all_skills) > 0:
        skill_freq = Counter(all_skills)
        
        # Get top 11 skills, then exclude the top one if it's an extreme outlier
        top_skills_raw = skill_freq.most_common(11)
        
        # Check if the top skill is an extreme outlier (>10x the second)
        if len(top_skills_raw) > 1 and top_skills_raw[0][1] > 10 * top_skills_raw[1][1]:
            print(f"Note: Excluded extreme outlier '{top_skills_raw[0][0]}' with {top_skills_raw[0][1]:,} occurrences")
            top_skills = top_skills_raw[1:11]  # Skip the first, take next 10
        else:
            top_skills = top_skills_raw[:10]
        
        skills, counts = zip(*top_skills)
        
        # Create dataframe for seaborn
        skill_df = pd.DataFrame({'Skill': skills, 'Frequency': counts})
        skill_df = skill_df.sort_values('Frequency')
        
        # Clean up skill names - remove quotes and brackets, capitalize properly
        skill_df['Skill'] = skill_df['Skill'].str.replace('"', '').str.replace("'", "")
        skill_df['Skill'] = skill_df['Skill'].str.replace(r'[\[\]\(\)]', '', regex=True)
        skill_df['Skill'] = skill_df['Skill'].str.strip()
        
        # Split combined labels - if comma-separated, take first item only
        skill_df['Skill'] = skill_df['Skill'].apply(lambda x: x.split(',')[0].strip() if ',' in x else x)
        skill_df['Skill'] = skill_df['Skill'].str.title()
        
        fig, ax = plt.subplots(figsize=(8, 5))
        bars = sns.barplot(data=skill_df, y='Skill', x='Frequency', palette='Blues_r', ax=ax)
        ax.set_title('Top 10 Most Common Skills in Job Postings', fontsize=14, fontweight='bold', pad=15)
        ax.set_xlabel('Frequency', fontsize=11, fontweight='bold')
        ax.set_ylabel('Skill', fontsize=11, fontweight='bold')
        
        # Add value labels
        for i, v in enumerate(skill_df['Frequency']):
            ax.text(v + max(skill_df['Frequency'])*0.02, i, f'{int(v):,}', 
                   va='center', fontweight='bold', fontsize=9)
        
        # Improve layout
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        plt.tight_layout()
        plt.show()
    else:
        print("No skills found")

Top 10 Skills Insight:

SAP Applications dominates with 1,114 postings, reflecting the strong demand for enterprise resource planning expertise in large organizations. Oracle Cloud (667) and Microsoft Office (637) follow, showing that cloud platforms and productivity tools remain essential. Communication (532) appears as the top soft skill, emphasizing that technical roles require strong interpersonal abilities. The data reveals a mix of enterprise software (SAP, Oracle Cloud), business analysis capabilities (Data Analysis, Dashboard, Project Management), and emerging areas like Cyber Security and UX Design. Job seekers should prioritize SAP expertise if targeting enterprise roles, while building a foundation in Microsoft Office, data analysis, and communication skills for broader market appeal.

12.4 3. Technical Skills Analysis

12.4.1 3.1 Top 10 Software & Technical Skills

# Focus on software/technical skills column
if 'SOFTWARE_SKILLS_NAME' in df.columns:
    software_skills = []
    
    for skills_text in df['SOFTWARE_SKILLS_NAME'].dropna():
        skills = [s.strip() for s in str(skills_text).split('|')]
        # Clean each skill thoroughly
        for skill in skills:
            clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '')
            clean = clean.replace('(', '').replace(')', '').strip().lower()
            if len(clean) > 2:
                software_skills.append(clean)
    
    software_skills = [s for s in software_skills if len(s) > 0]
    
    if len(software_skills) > 0:
        software_freq = Counter(software_skills)
        
        # Get top 11, check for outlier
        top_software_raw = software_freq.most_common(11)
        
        if len(top_software_raw) > 1 and top_software_raw[0][1] > 10 * top_software_raw[1][1]:
            print(f"Note: Excluded extreme outlier '{top_software_raw[0][0]}' with {top_software_raw[0][1]:,} occurrences")
            top_software = top_software_raw[1:11]
        else:
            top_software = top_software_raw[:10]
        
        software, counts = zip(*top_software)
        
        software_df = pd.DataFrame({'Software': software, 'Count': counts})
        software_df = software_df.sort_values('Count')
        
        # Clean software names - remove quotes, brackets, parentheses
        software_df['Software'] = software_df['Software'].str.replace('"', '').str.replace("'", "")
        software_df['Software'] = software_df['Software'].str.replace(r'[\[\]\(\)]', '', regex=True)
        software_df['Software'] = software_df['Software'].str.strip()
        
        # Split combined labels - if comma-separated, take first item only
        software_df['Software'] = software_df['Software'].apply(lambda x: x.split(',')[0].strip() if ',' in x else x)
        software_df['Software'] = software_df['Software'].str.title()
        
        fig, ax = plt.subplots(figsize=(8, 5))
        sns.barplot(data=software_df, y='Software', x='Count', palette='viridis_r', ax=ax)
        ax.set_title('Top 10 Software Skills in Job Postings', fontsize=14, fontweight='bold', pad=15)
        ax.set_xlabel('Number of Job Postings', fontsize=11, fontweight='bold')
        ax.set_ylabel('Software Skill', fontsize=11, fontweight='bold')
        
        # Add value labels at the end of each bar
        max_val = software_df['Count'].max()
        for i, v in enumerate(software_df['Count']):
            ax.text(v + max_val*0.01, i, f'{int(v):,}', 
                   va='center', fontweight='bold', fontsize=9)
        
        # Set x-axis limit to prevent cutoff
        ax.set_xlim(0, max_val * 1.15)
        
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        plt.tight_layout()
        plt.show()

Top 10 Software Skills - Job Seeker Insights:

SAP Applications leads dramatically with 934 postings, making it the single most valuable software skill for enterprise-focused careers. Oracle Cloud (633) and Microsoft Office (602) demonstrate the importance of cloud infrastructure and productivity suites. Dashboard skills (433) reflect the growing need for data visualization capabilities across all roles. SQL Programming Language (353) and Microsoft Excel (282) remain fundamental for data manipulation and analysis. The presence of specialized tools like Anaplan (196), Onestream CPM Software (196), and Oracle E-Business Suite (173) indicates niche opportunities in financial planning and enterprise systems. Job seekers should master SAP for enterprise roles, Excel and SQL for data work, and consider specializing in emerging tools like Anaplan for competitive advantage.

12.5 4. Specialized Skills Analysis

12.5.1 4.1 Top 10 Specialized Skills

# Focus on specialized skills column
if 'SPECIALIZED_SKILLS_NAME' in df.columns:
    specialized_skills = []
    
    for skills_text in df['SPECIALIZED_SKILLS_NAME'].dropna():
        skills = [s.strip() for s in str(skills_text).split('|')]
        # Clean each skill thoroughly
        for skill in skills:
            clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '')
            clean = clean.replace('(', '').replace(')', '').strip().lower()
            if len(clean) > 2:
                specialized_skills.append(clean)
    
    specialized_skills = [s for s in specialized_skills if len(s) > 0]
    
    if len(specialized_skills) > 0:
        spec_freq = Counter(specialized_skills)
        top_spec = spec_freq.most_common(10)
        
        spec, counts = zip(*top_spec)
        
        spec_df = pd.DataFrame({'Skill': spec, 'Count': counts})
        spec_df = spec_df.sort_values('Count')
        
        # Clean skill names - remove quotes, brackets, parentheses
        spec_df['Skill'] = spec_df['Skill'].str.replace('"', '').str.replace("'", "")
        spec_df['Skill'] = spec_df['Skill'].str.replace(r'[\[\]\(\)]', '', regex=True)
        spec_df['Skill'] = spec_df['Skill'].str.strip()
        
        # Split combined labels - if comma-separated, take first item only
        spec_df['Skill'] = spec_df['Skill'].apply(lambda x: x.split(',')[0].strip() if ',' in x else x)
        
        # Truncate long skill names to prevent overlap
        spec_df['Skill'] = spec_df['Skill'].apply(lambda x: x[:35] + '...' if len(x) > 35 else x)
        spec_df['Skill'] = spec_df['Skill'].str.title()
        
        fig, ax = plt.subplots(figsize=(8, 6))
        sns.barplot(data=spec_df, y='Skill', x='Count', palette='rocket_r', ax=ax)
        ax.set_title('Top 10 Specialized Skills in Job Postings', fontsize=14, fontweight='bold', pad=15)
        ax.set_xlabel('Number of Job Postings', fontsize=11, fontweight='bold')
        ax.set_ylabel('Specialized Skill', fontsize=11, fontweight='bold')
        
        # Add value labels
        max_val = spec_df['Count'].max()
        for i, v in enumerate(spec_df['Count']):
            ax.text(v + max_val*0.01, i, f'{int(v):,}', 
                   va='center', fontweight='bold', fontsize=9)
        
        # Set x-axis limit
        ax.set_xlim(0, max_val * 1.15)
        
        # Adjust y-axis labels to prevent overlap
        ax.tick_params(axis='y', labelsize=10)
        plt.yticks(rotation=0)
        
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        plt.tight_layout()
        plt.show()

Top 10 Specialized Skills - Job Seeker Insights:

User Experience (UX) Design leads with 260 postings, highlighting the critical importance of user-centered design in modern product development. Data Analysis (220) and Cloud Computing (196) show strong demand for analytical and cloud architecture expertise. Emergency Response (143) and Pivot Tables and Charts (125) represent specialized operational and analytical capabilities. SAP Applications (104) appears again, reinforcing its enterprise value. Microsoft Access (89), Databricks (85), Mulesoft (85), and Sales Process (74) round out the list, showing diverse specializations from database management to integration platforms and CRM. Job seekers should prioritize UX design for product roles, data analysis for analytical positions, and cloud computing for infrastructure careers, while considering niche specializations like Databricks or Mulesoft for premium positioning.

12.6 5. Skill Requirements by Job Title

12.6.1 5.1 Skills vs Top Job Titles - Heatmap

# Cross-tabulate skills by top job titles
if 'TITLE_NAME' in df.columns and 'SOFTWARE_SKILLS_NAME' in df.columns:
    print("Creating heatmap...")
    
    # Get top 8 job titles
    top_titles = df['TITLE_NAME'].value_counts().head(8).index
    print(f"Top job titles: {list(top_titles)}")
    
    # Get top 8 software skills (excluding extreme outlier and empty strings)
    all_software = []
    for skills_text in df['SOFTWARE_SKILLS_NAME'].dropna():
        skills = [s.strip() for s in str(skills_text).split('|')]
        for skill in skills:
            # Clean thoroughly
            clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '')
            clean = clean.replace('(', '').replace(')', '').strip().lower()
            if len(clean) > 2:
                all_software.append(clean)
    
    software_freq = Counter(all_software)
    top_software_list = software_freq.most_common(15)
    
    # Filter out problematic entries
    filtered_software = []
    for skill, count in top_software_list:
        # Skip if contains 'cpm', 'onestream', or is too short
        if 'cpm' not in skill.lower() and 'onestream' not in skill.lower() and len(skill) > 2:
            filtered_software.append((skill, count))
    
    # Check for extreme outlier
    if len(filtered_software) > 1 and filtered_software[0][1] > 10 * filtered_software[1][1]:
        top_8_software = [s[0] for s in filtered_software[1:9]]
    else:
        top_8_software = [s[0] for s in filtered_software[:8]]
    
    print(f"Top software skills: {top_8_software}")
    
    # Build matrix
    skill_title_matrix = []
    
    for title in top_titles:
        title_data = df[df['TITLE_NAME'] == title].copy()
        skill_row = {'Title': title}
        
        for skill in top_8_software:
            count = 0
            for skills_text in title_data['SOFTWARE_SKILLS_NAME'].dropna():
                if skill in str(skills_text).lower():
                    count += 1
            # Clean skill name for column header - remove all special characters
            clean_skill = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '')
            clean_skill = clean_skill.replace('(', '').replace(')', '').replace(',', '').strip()
            # Take only first part if multiple words separated by comma
            if ',' in clean_skill or len(clean_skill) > 20:
                clean_skill = clean_skill.split(',')[0].split()[0:3]  # First 3 words max
                clean_skill = ' '.join(clean_skill)
            clean_skill = clean_skill.title()[:20]  # Max 20 chars
            skill_row[clean_skill] = (count / len(title_data) * 100) if len(title_data) > 0 else 0
        
        skill_title_matrix.append(skill_row)
    
    if skill_title_matrix:
        print(f"Matrix created with {len(skill_title_matrix)} rows")
        skill_matrix_df = pd.DataFrame(skill_title_matrix)
        skill_matrix_df = skill_matrix_df.set_index('Title')
        
        print(f"Dataframe shape: {skill_matrix_df.shape}")
        print(f"Columns: {list(skill_matrix_df.columns)}")
        
        # Shorten job titles if too long
        skill_matrix_df.index = [title[:25] + '...' if len(title) > 25 else title 
                                  for title in skill_matrix_df.index]
        
        fig, ax = plt.subplots(figsize=(8, 4))
        sns.heatmap(skill_matrix_df, annot=True, fmt='.1f', cmap='YlOrRd', 
                    cbar_kws={'label': '% of Postings'}, 
                    ax=ax, linewidths=0.5, linecolor='white', annot_kws={'size': 7.5})
        ax.set_title('Software Skill Requirements by Job Title', 
                     fontsize=12, fontweight='bold', pad=10)
        ax.set_xlabel('Software Skill', fontsize=10, fontweight='bold')
        ax.set_ylabel('Job Title', fontsize=10, fontweight='bold')
        plt.xticks(rotation=45, ha='right', fontsize=8)
        plt.yticks(rotation=0, fontsize=8)
        
        plt.tight_layout()
        plt.show()
        print("Heatmap displayed successfully!")
    else:
        print("ERROR: No skill matrix data created")
else:
    print("ERROR: Required columns not found")
Creating heatmap...
Top job titles: ['Data Analysts', 'Unclassified', 'Business Intelligence Analysts', 'Enterprise Architects', 'Data Modelers', 'Data Governance Analysts', 'Oracle Cloud HCM Consultants', 'Solutions Architects']
Top software skills: ['sap applications', 'oracle cloud', 'microsoft office', 'dashboard', 'sql programming language', 'microsoft excel', 'microsoft powerpoint,\n  microsoft excel', 'oracle e-business suite']
Matrix created with 8 rows
Dataframe shape: (8, 8)
Columns: ['Sap Applications', 'Oracle Cloud', 'Microsoft Office', 'Dashboard', 'Sql Programming Lang', 'Microsoft Excel', 'Microsoft Powerpoint', 'Oracle E-Business Su']

Heatmap displayed successfully!

Skills vs Job Titles - Job Seeker Insights:

The heatmap reveals distinct skill patterns across job roles. Unclassified positions show the highest demand for general software skills (37.2% require SAP Applications), suggesting broad technical requirements. Oracle Cloud HCM Consultants have specialized focus with 11.1% requiring Oracle Cloud expertise. Data Governance Analysts need the most diverse skillset with strong SQL Programming Language (13.8%) and moderate requirements across Dashboard, Microsoft Office, and SAP. Enterprise Architects and Solutions Architects show concentrated needs in SAP (10.5% and 8.5% respectively) with minimal other software requirements. Data Modelers uniquely emphasize SQL Programming (2.1%) over other tools. The low percentages overall indicate that most roles don’t mandate specific software, creating opportunities for candidates to differentiate through technical mastery. Job seekers should target SAP for enterprise roles, Oracle Cloud for HCM consulting, and SQL for data-focused positions.

12.7 Summary

ImportantNLP Analysis Key Findings

Skills Analysis Results:

  • ✓ Successfully analyzed skills from structured data columns
  • ✓ Identified top 10 most common skills across all job postings
  • ✓ Extracted and ranked top 10 software/technical skills
  • ✓ Analyzed top 10 specialized skills for advanced roles
  • ✓ Cross-referenced skills with job title requirements via heatmap

Key Insights:

  • Specific skills dominate the job market across different categories
  • Software skills are critical technical differentiators
  • Specialized skills offer opportunities for premium positioning
  • Skills requirements vary significantly by job title

Recommendations for Job Seekers:

  • Prioritize learning the top 10 skills identified in each category
  • Develop proficiency in the most demanded software tools
  • Build specialized skills for senior or expert-level opportunities
  • Focus on developing complementary skills that frequently co-occur
  • Use the heatmap to understand skill priorities for your target job titles
  • Tailor your resume to highlight relevant skill combinations

12.8 References

  • Pandas Documentation: https://pandas.pydata.org/
  • Seaborn Documentation: https://seaborn.pydata.org/
  • Collections Counter: https://docs.python.org/3/library/collections.html#collections.Counter

13 Skill Gap Analysis

13.1 Introduction

This analysis compares our team’s current skill levels against the skills demanded in the IT job market. By identifying gaps between our capabilities and market requirements, we can develop targeted learning strategies to enhance our competitiveness as job candidates.

import pandas as pd
import numpy as np
import hvplot.pandas
import panel as pn
from collections import Counter
import re
import warnings
warnings.filterwarnings('ignore')

# Enable panel for rendering
pn.extension()

# Color palette
COLORS = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6', '#1abc9c']

# Load cleaned job data
df = pd.read_csv('cleanedjob_postings.csv')
print(f"Analyzing {len(df):,} job postings for skill requirements")
Analyzing 59,220 job postings for skill requirements

13.2 1. Team Skill Assessment

13.2.1 1.1 Creating Team Skills Profile

Each team member rates their proficiency in key IT skills on a scale of 1-5: - 1 = Beginner (aware of the skill) - 2 = Basic knowledge (can perform simple tasks) - 3 = Intermediate (comfortable with common scenarios) - 4 = Advanced (can handle complex problems) - 5 = Expert (can teach others and solve any problem)

# Team member skill assessments
# TODO: Replace with actual team member names and skill ratings
skills_data = {
    "Name": ["Tuba Anwar", "Kriti Singh", "Soham Deshkhaire"],
    "Python": [4, 3, 4],
    "SQL": [3, 4, 3],
    "Machine Learning": [3, 2, 4],
    "Data Analysis": [4, 4, 3],
    "Cloud Computing": [2, 2, 3],
    "Java": [3, 2, 2],
    "JavaScript": [2, 3, 3],
    "R": [3, 3, 2],
    "Tableau": [3, 4, 2],
    "Excel": [4, 5, 3],
    "AWS": [2, 1, 2],
    "Azure": [1, 2, 2],
    "Docker": [2, 1, 3],
    "Git": [3, 3, 4],
    "Power BI": [2, 3, 2],
    "Spark": [2, 2, 3],
    "TensorFlow": [2, 1, 3],
    "NLP": [3, 2, 3]
}

df_team_skills = pd.DataFrame(skills_data)
df_team_skills.set_index("Name", inplace=True)

# Display team skills
#print("Team Skills Profile:")
#print(df_team_skills)

# Calculate team averages
team_avg = df_team_skills.mean().sort_values(ascending=False)
#print(f"\nTeam Average Skills (sorted):")
#print(team_avg)

The team’s strongest skills are *Excel, Python, and Data Analysis, showing solid readiness for analytical and data-focused roles. Skills like **SQL, Git, Tableau, and Machine Learning* are moderately strong, indicating reliable but improvable proficiency. Meanwhile, areas such as AWS, Azure, TensorFlow, Docker, and Cloud Computing show lower scores, highlighting clear opportunities for growth—especially important given rising market demand for cloud and ML engineering skills. Overall, the team has a strong analytical foundation but should focus on boosting cloud and advanced technical competencies to stay competitive.

13.2.2 1.2 Team Skills Heatmap

Visualize each team member’s strengths and weaknesses across all skills.

# Create heatmap using hvPlot
heatmap = df_team_skills.T.hvplot.heatmap(
    title='Team Skill Proficiency Heatmap',
    cmap='RdYlGn',
    height=700,
    width=800,
    xlabel='Team Member',
    ylabel='Skill',
    clabel='Proficiency Level',
    rot=0
)
heatmap

This heatmap shows how strong each team member is across key tech skills like AWS, Excel, SQL, Machine Learning, and more. The greener the box, the stronger the skill—red means the weakest. From a job seeker’s perspective, this quickly highlights which skills are common strengths and which ones need improvement to stay competitive. For example, Excel, Data Analysis, and SQL show strong proficiency across the team, meaning these are essential, in-demand skills worth mastering. Meanwhile, AWS and JavaScript show weaker proficiency, signaling great opportunities for upskilling—especially since cloud and coding skills are highly valued in today’s market.

13.2.3 1.3 Team Average Skills Bar Chart

Compare team average proficiency across all skills.

# Prepare data for bar chart
team_avg_df = team_avg.reset_index()
team_avg_df.columns = ['Skill', 'Average Proficiency']

# Create bar chart
chart = team_avg_df.sort_values('Average Proficiency', ascending=True).hvplot.barh(
    x='Skill',
    y='Average Proficiency',
    title='Team Average Skill Proficiency',
    height=700,
    width=900,
    color='#3498db',
    xlabel='Average Proficiency (1-5)',
    ylabel='',
    flip_yaxis=True
)
chart

The team’s top 10 skills show strong proficiency in high-value areas like *Excel, Data Analysis, Python, Git, and SQL, which are core requirements for most analytics and tech roles. Mid-level strengths in **Machine Learning, Tableau, R, and JavaScript* show the team can handle more advanced tasks but still has room to grow. Overall, these skills place the team in a competitive position for data and tech jobs, while highlighting opportunities to strengthen cloud and AI-related tools for even better market readiness.

13.3 2. Market Skill Demand Analysis

13.3.1 2.1 Extracting Skills from Job Descriptions

We analyze job postings to identify the most in-demand skills in the IT job market.

# Define comprehensive skill keywords to search for
skill_keywords = {
    'Python': ['python', 'python3', 'py'],
    'SQL': ['sql', 'mysql', 'postgresql', 'sql server', 'oracle sql'],
    'Machine Learning': ['machine learning', 'ml', 'deep learning', 'neural network'],
    'Data Analysis': ['data analysis', 'data analytics', 'analytical'],
    'Cloud Computing': ['cloud', 'cloud computing', 'cloud services'],
    'Java': ['java', 'java8', 'java 8'],
    'JavaScript': ['javascript', 'js', 'node.js', 'nodejs'],
    'R': [' r ', 'r programming', 'rstudio'],
    'Tableau': ['tableau'],
    'Excel': ['excel', 'microsoft excel', 'advanced excel'],
    'AWS': ['aws', 'amazon web services', 'ec2', 's3'],
    'Azure': ['azure', 'microsoft azure'],
    'Docker': ['docker', 'containerization'],
    'Git': ['git', 'github', 'version control'],
    'Power BI': ['power bi', 'powerbi'],
    'Spark': ['spark', 'apache spark', 'pyspark'],
    'TensorFlow': ['tensorflow', 'tf'],
    'NLP': ['nlp', 'natural language processing', 'text mining']
}

# Function to extract skills from text
def extract_skills(text):
    if pd.isna(text):
        return []
    text = str(text).lower()
    found_skills = []
    for skill, keywords in skill_keywords.items():
        for keyword in keywords:
            if keyword in text:
                found_skills.append(skill)
                break
    return found_skills

# Extract skills from job titles and descriptions (if available)
if 'TITLE_NAME' in df.columns:
    df['extracted_skills'] = df['TITLE_NAME'].apply(extract_skills)
    
    # If job description available, combine with title
    if 'DESCRIPTION' in df.columns or 'JOB_DESCRIPTION' in df.columns:
        desc_col = 'DESCRIPTION' if 'DESCRIPTION' in df.columns else 'JOB_DESCRIPTION'
        df['extracted_skills'] = df.apply(
            lambda row: list(set(extract_skills(row['TITLE_NAME']) + extract_skills(row[desc_col]))),
            axis=1
        )

# Count skill occurrences
all_skills = [skill for skills_list in df['extracted_skills'] for skill in skills_list]
skill_counts = Counter(all_skills)
market_skills_df = pd.DataFrame(skill_counts.items(), columns=['Skill', 'Job Postings'])
market_skills_df = market_skills_df.sort_values('Job Postings', ascending=False)
market_skills_df['Percentage'] = (market_skills_df['Job Postings'] / len(df) * 100).round(2)

#print("Market Skill Demand (Top Skills):")
#print(market_skills_df.head(10))

The skill extraction shows that Data Analysis, Cloud Computing, and TensorFlow are the most frequently requested skills in job postings, making them top priorities for job seekers. Technical fundamentals like Git, SQL, and Azure also appear often, highlighting the importance of both data-related and cloud skills. Lower-frequency skills such as Spark or AWS still matter but are requested less often, suggesting they may serve as valuable “bonus” skills rather than core requirements.

13.3.2 2.2 Top In-Demand Skills Visualization

# Create bar chart of top 10 market skills
top_market_skills = market_skills_df.head(10).sort_values('Job Postings', ascending=True)

chart = top_market_skills.hvplot.barh(
    x='Skill',
    y='Job Postings',
    title='Top 10 Most In-Demand Skills in Job Market (2024)',
    height=600,
    width=900,
    color='#3498db',
    hover_cols=['Percentage'],
    xlabel='Number of Job Postings Requiring Skill',
    ylabel='',
    flip_yaxis=True
)
chart

The chart shows that Data Analysis and Cloud Computing are the most in-demand skills in the 2024 job market, with far more postings than any other skill. This means employers are strongly prioritizing candidates who can analyze data and work with cloud platforms. Skills like *TensorFlow, Git, SQL, Azure, and **Java* are also highly requested, making them great additions to your skillset. Lower-demand skills such as *Spark, AWS, Tableau, and **Machine Learning* still matter, but they won’t give as much of a competitive edge as the top skills.

13.4 3. Skill Gap Identification

13.4.1 3.1 Comparing Team Skills to Market Demand

Now we identify gaps between our team’s capabilities and what the market requires.

# Normalize market demand to 1-5 scale for comparison
# We'll scale based on percentage of jobs requiring each skill
max_percentage = market_skills_df['Percentage'].max()

# Create comparison dataframe
comparison_data = []

for skill in df_team_skills.columns:
    team_avg_skill = team_avg[skill]
    
    # Get market demand (normalized to 1-5 scale)
    if skill in market_skills_df['Skill'].values:
        market_pct = market_skills_df[market_skills_df['Skill'] == skill]['Percentage'].values[0]
        market_demand = (market_pct / max_percentage) * 5  # Scale to 1-5
    else:
        market_demand = 0
    
    gap = market_demand - team_avg_skill
    
    comparison_data.append({
        'Skill': skill,
        'Team Average': team_avg_skill,
        'Market Demand': market_demand,
        'Gap': gap,
        'Gap_Category': 'Strength' if gap <= 0 else ('Moderate Gap' if gap <= 2 else 'Critical Gap')
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('Gap', ascending=False)

#print("Skill Gap Analysis:")
#print(comparison_df)

The analysis shows that our team is *strong in most core skills, especially Python, SQL, Tableau, Machine Learning, and Git — all areas where market demand is relatively low to moderate. However, there are **two skills with clear gaps* compared to what employers want:

  • Cloud Computing (e.g., AWS, Azure)
  • Data Analysis

These areas are in very high demand, but our team’s proficiency is lower than what the job market expects. Strengthening Cloud tools and advanced Data Analysis techniques would significantly boost job readiness and competitiveness in today’s tech hiring landscape.

13.4.2 3.2 Skill Gap Visualization

# Create comparison dataframe for visualization
comparison_melted = comparison_df.melt(
    id_vars='Skill',
    value_vars=['Team Average', 'Market Demand'],
    var_name='Metric',
    value_name='Level'
)

# Sort by Market Demand
skill_order = comparison_df.sort_values('Market Demand', ascending=False)['Skill'].tolist()

# Create grouped bar chart
chart = comparison_melted.hvplot.bar(
    x='Skill',
    y='Level',
    by='Metric',
    title='Team Skills vs Market Demand Comparison',
    height=600,
    width=1000,
    ylabel='Proficiency / Demand Level (1-5)',
    xlabel='Skill',
    legend='top_right',
    rot=45
)
chart

The chart clearly shows that market demand is much higher than the team’s current skill levels across nearly all skills. The biggest gaps appear in Cloud Computing, Data Analysis, TensorFlow, Azure, and AWS, meaning these are the most urgent areas for development. Skills like Git, SQL, Python, and Machine Learning are team strengths, but still trail behind what employers expect.

For job seekers like us, this means focusing on cloud technologies, advanced analytics, and AI frameworks will significantly boost competitiveness and align better with real market needs.

13.4.3 3.3 Skill Gap Priority Matrix

Identify which skills need immediate attention based on gap size.

# Create scatter plot: Team proficiency vs Market demand
chart = comparison_df.hvplot.scatter(
    x='Team Average',
    y='Market Demand',
    by='Gap_Category',
    size=abs(comparison_df['Gap']) * 50,
    hover_cols=['Skill', 'Gap'],
    title='Skill Gap Priority Matrix',
    height=700,
    width=900,
    xlabel='Team Average Proficiency (1-5)',
    ylabel='Market Demand Level (1-5 normalized)',
    legend='top_left',
    color=['#2ecc71', '#f39c12', '#e74c3c']
)
chart

The matrix helps identify which skills your team should focus on first.

  • Green (Moderate Gap): These skills—like Cloud Computing, Data Analysis, and TensorFlow—are high in market demand but lower in team proficiency, meaning they should be top priority for upskilling.
  • Orange (Strength): Most other skills fall in this category. These are areas where the team is already strong compared to market demand—great to maintain but not urgent for improvement.

13.5 4. Individual Skill Gap Analysis

13.5.1 4.1 Skill Gaps by Team Member

Identify personalized skill development needs for each team member.

# Calculate individual gaps
individual_gaps = []

for member in df_team_skills.index:
    for skill in df_team_skills.columns:
        member_skill = df_team_skills.loc[member, skill]
        
        # Get market demand
        if skill in market_skills_df['Skill'].values:
            market_pct = market_skills_df[market_skills_df['Skill'] == skill]['Percentage'].values[0]
            market_demand = (market_pct / max_percentage) * 5
        else:
            market_demand = 0
        
        gap = market_demand - member_skill
        
        if gap > 0:  # Only include gaps (areas for improvement)
            individual_gaps.append({
                'Member': member,
                'Skill': skill,
                'Current Level': member_skill,
                'Market Demand': market_demand,
                'Gap': gap
            })

individual_gaps_df = pd.DataFrame(individual_gaps)
individual_gaps_df = individual_gaps_df.sort_values(['Member', 'Gap'], ascending=[True, False])

# Show top 5 gaps per member
#print("Top 5 Skill Gaps per Team Member:")
for member in df_team_skills.index:
    #print(f"\n{member}:")
    member_gaps = individual_gaps_df[individual_gaps_df['Member'] == member].head(5)
    #print(member_gaps[['Skill', 'Current Level', 'Gap']].to_string(index=False))

ach team member has two main skill gaps: Cloud Computing and Data Analysis.

Tuba Anwar needs improvement mainly in Cloud Computing, with a smaller gap in Data Analysis.

Kriti Singh also shows the same pattern—Cloud Computing is the biggest gap, followed by Data Analysis.

Soham Deshkhaires has the highest gap in Data Analysis, and a moderate gap in Cloud Computing.

Overall Insight: All three team members share the same critical areas for improvement. Strengthening Cloud Computing and Data Analysis should be the top priority for the team.

13.5.2 4.2 Individual Gap Visualization

# Get top 5 gaps per member
top_individual_gaps = individual_gaps_df.groupby('Member').head(5)

# Create bar chart for each member
for member in df_team_skills.index:
    member_data = top_individual_gaps[top_individual_gaps['Member'] == member].sort_values('Gap', ascending=True)
    
    plot = member_data.hvplot.barh(
        x='Skill',
        y='Gap',
        title=f'Top 5 Skill Gaps - {member}',
        height=400,
        width=700,
        color='#e74c3c',
        xlabel='Skill Gap (Market - Current)',
        ylabel='',
        flip_yaxis=True
    )
    
    display(plot)

Across all three members—Tuba, Kriti, and Soham—the largest skill gap is in Cloud Computing, meaning the market demands this skill at a much higher level than the team currently possesses. This makes Cloud Computing the top priority area for improvement for everyone.

Additionally, Data Analysis appears as a major gap for both Tuba and Kriti, while Soham shows a moderate gap in Cloud Computing only. This indicates that although the team already has some analytical skills, the job market expects a stronger command in this area.

The charts highlight two urgent development needs for the team:

Cloud Computing → biggest gap for all members

Data Analysis → second-highest gap for Tuba & Kriti

Focusing training efforts on these skills will significantly improve alignment with market expectations.individual

13.6 5. Skill Development Plan

13.6.1 5.1 Priority Skills for Team Development

Based on our analysis, here are the priority skills the team should focus on:

# Identify critical gaps (gap > 2)
critical_gaps = comparison_df[comparison_df['Gap'] > 2].sort_values('Gap', ascending=False)

#print("CRITICAL SKILLS TO DEVELOP (Gap > 2):")
#print(critical_gaps[['Skill', 'Team Average', 'Market Demand', 'Gap']])

# Identify moderate gaps (1 < gap <= 2)
moderate_gaps = comparison_df[(comparison_df['Gap'] > 1) & (comparison_df['Gap'] <= 2)].sort_values('Gap', ascending=False)

#print("\n\nMODERATE PRIORITY SKILLS (1 < Gap <= 2):")
#print(moderate_gaps[['Skill', 'Team Average', 'Market Demand', 'Gap']])

The analysis shows that the team has *no critical skill gaps, meaning no skill is urgently below market expectations. However, two areas — **Cloud Computing* and Data Analysis — fall into the moderate-priority category. These skills have higher market demand than the team’s current proficiency levels, making them important for upskilling. Focusing on these areas will help the team stay competitive, align with industry expectations, and strengthen overall technical capability.

# Create learning recommendations
learning_resources = {
    'Python': {
        'Courses': ['Python for Data Science (Coursera)', 'Complete Python Bootcamp (Udemy)'],
        'Practice': ['LeetCode Python problems', 'HackerRank Python track'],
        'Time': '2-3 months for intermediate proficiency'
    },
    'SQL': {
        'Courses': ['SQL for Data Science (Coursera)', 'The Complete SQL Bootcamp (Udemy)'],
        'Practice': ['SQLZoo', 'LeetCode Database problems'],
        'Time': '1-2 months'
    },
    'Machine Learning': {
        'Courses': ['Machine Learning by Andrew Ng (Coursera)', 'Fast.ai Practical Deep Learning'],
        'Practice': ['Kaggle competitions', 'Personal ML projects'],
        'Time': '3-6 months'
    },
    'AWS': {
        'Courses': ['AWS Certified Solutions Architect (A Cloud Guru)', 'AWS Free Tier hands-on'],
        'Practice': ['Build personal projects on AWS', 'AWS Cloud Quest'],
        'Time': '2-3 months'
    },
    'Docker': {
        'Courses': ['Docker Mastery (Udemy)', 'Docker documentation'],
        'Practice': ['Containerize personal projects', 'Docker Hub'],
        'Time': '1-2 months'
    },
    'Cloud Computing': {
        'Courses': ['Cloud Computing Concepts (Coursera)', 'Google Cloud Training'],
        'Practice': ['Multi-cloud projects', 'Free tier experimentation'],
        'Time': '2-4 months'
    }
}

13.6.2 5.3 Team Collaboration Strategy

How can team members help each other?

# Identify team strengths (where team exceeds market demand)
team_strengths = comparison_df[comparison_df['Gap'] < 0].sort_values('Team Average', ascending=False)

#print("TEAM STRENGTHS (Can mentor others):")
#print(team_strengths[['Skill', 'Team Average', 'Market Demand']])

# Create mentoring pairs based on individual strengths
#print("\n\nSUGGESTED MENTORING OPPORTUNITIES:")

The team collaboration strategy highlights how members can support one another by leveraging individual strengths to address skill gaps. Team strengths such as Excel, Python, SQL, Git, Tableau, Machine Learning, JavaScript, NLP, R, Spark, Power BI, Java, TensorFlow, Docker, Azure, and AWS show which members have above-average proficiency and can mentor others. Suggested mentoring pathways include: Soham Deshkhaire (Level 4) mentoring Kriti Singh (Level 2) in Machine Learning, Kriti Singh (Level 4) mentoring Soham in Tableau, Kriti (Level 5) mentoring Soham (Level 3) in Excel, and Soham (Level 3) mentoring Kriti (Level 1) in Docker. These targeted pairings ensure knowledge transfer, help close individual skill gaps, and strengthen overall team capability.

13.6.3 5.4 3-Month, 6-Month, and 1-Year Goals

Create timeline for skill development:

TipSkill Development Timeline

3-Month Goals (Immediate Priority) - Focus on critical gap skills with highest market demand - Complete foundational courses in AWS, Docker, and Cloud Computing - Build 1-2 hands-on projects demonstrating new skills - Team members mentor each other in strength areas

6-Month Goals (Intermediate) - Achieve intermediate proficiency (Level 3) in all critical gap skills - Complete advanced courses in Machine Learning and Data Analysis - Participate in Kaggle competitions or contribute to open-source projects - Earn 1-2 professional certifications (e.g., AWS Certified Developer)

1-Year Goals (Advanced) - Achieve advanced proficiency (Level 4) in priority skills - Entire team reaches minimum Level 3 in all high-demand market skills - Build comprehensive portfolio showcasing technical competencies - Competitive job candidates for targeted IT roles

13.7 References

  • Job market data: Lightcast Job Postings Dataset (2024)
  • Skill assessment framework: Industry-standard proficiency scales
  • Learning resources: Coursera, Udemy, AWS Training, Kaggle
  • Analysis tools: Python, pandas, hvPlot, Panel

14 Machine Learning Methods


title: “Machine Learning Methods” subtitle: “Clustering and Classification for Job Market Analysis” format: html: toc: true number-sections: true df-print: paged code-overflow: wrap embed-resources: true css: styles.css jupyter: python3 execute: echo: true eval: true warning: false message: false —

14.1 Introduction

This section applies machine learning techniques to uncover patterns in job market data, with a specific focus on Business Analytics, Data Science, and Machine Learning roles. As job seekers entering these competitive fields in 2024, understanding the hidden structures in job postings and identifying role characteristics can provide strategic advantages in career planning.

We employ two complementary machine learning approaches:

  1. K-Means Clustering: To discover natural groupings in BA/DS/ML job postings
  2. Classification Models: To distinguish between different role types
Dataset loaded: 59,220 rows, 56 columns

14.2 Data Filtering for BA/DS/ML Analysis

To focus our analysis on relevant career paths for Business Analytics (BA), Data Science (DS), and Machine Learning (ML) professionals, we filter the dataset to include only positions matching these disciplines. Number of filtered jobs for BA/DS/ML are 15,378. This is around 25.97% of the data

# Define keywords for BA/DS/ML roles
ba_ds_ml_keywords = [
    'data scientist', 'data science', 'machine learning', 'ml engineer',
    'business analyst', 'business analytics', 'data analyst', 'data analytics',
    'ai engineer', 'artificial intelligence', 'deep learning', 
    'quantitative analyst', 'analytics', 'statistician', 'research scientist'
]

# Filter based on job titles
mask = df['TITLE_NAME'].str.lower().str.contains(
    '|'.join(ba_ds_ml_keywords), 
    na=False, 
    regex=True
)
df_filtered = df[mask].copy()

job_title_head = df_filtered['TITLE_NAME'].value_counts().head(10)

job_title_head.to_csv("./_output/Filtered_Job_Titles.csv")
import pandas

job_titles = pd.read_csv("./_output/Filtered_Job_Titles.csv")
# hide index pandas

job_titles.style.hide(axis="index")
TITLE_NAME count
Data Analysts 6409
ERP Business Analysts 369
Data Analytics Engineers 343
Data Analytics Interns 328
Lead Data Analysts 319
Data Analytics Analysts 256
Master Data Analysts 234
Business Intelligence Data Analysts 223
IT Data Analytics Analysts 221
SAP Business Analysts 206

14.3 Feature Engineering

Before applying machine learning algorithms, we need to prepare our features. We’ll focus on quantitative measures that can help us understand job characteristics.

# Calculate average salary if not already present
if 'AVG_SALARY' not in df_filtered.columns:
    # Create synthetic salary data for demonstration
    # In real analysis, you would have actual salary data
    np.random.seed(42)
    df_filtered['AVG_SALARY'] = np.random.normal(95000, 25000, len(df_filtered))
    df_filtered['AVG_SALARY'] = df_filtered['AVG_SALARY'].clip(lower=40000, upper=200000)

# Create experience level from MIN_YEARS_EXPERIENCE
df_filtered['EXPERIENCE_YEARS'] = df_filtered['MIN_YEARS_EXPERIENCE'].fillna(0)

# Convert DURATION to numeric (days)
df_filtered['DURATION_DAYS'] = pd.to_numeric(df_filtered['DURATION'], errors='coerce').fillna(30)

# Create binary remote indicator
df_filtered['IS_REMOTE'] = (df_filtered['REMOTE_TYPE_NAME'] == 'Remote').astype(int)

# Summary statistics
print("summarydf")
summarydf = df_filtered[['AVG_SALARY', 'EXPERIENCE_YEARS', 'DURATION_DAYS']].describe()
summarydf.to_csv("./_output/Continuous_summary.csv")
summarydf

14.4 K-Means Clustering Analysis

Clustering helps us discover natural groupings in the job market. Different clusters might represent entry-level vs. senior positions, different specializations, or regional variations.

14.4.1 Elbow Method for Optimal K

# Prepare features for clustering
cluster_features = ['AVG_SALARY', 'EXPERIENCE_YEARS', 'DURATION_DAYS', 'IS_REMOTE']
df_cluster = df_filtered[cluster_features].dropna()

print(f"Clustering dataset: {len(df_cluster):,} samples")

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_cluster)

# Elbow method
inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Plot - smaller size
elbow_df = pd.DataFrame({'K': list(K_range), 'Inertia': inertias})

plt.figure(figsize=(8, 5))
sns.lineplot(data=elbow_df, x='K', y='Inertia', marker='o', 
             linewidth=2.5, markersize=10, color='#2196F3')
plt.xlabel('Number of Clusters (K)', fontsize=11, fontweight='bold')
plt.ylabel('Inertia', fontsize=11, fontweight='bold')
plt.title('Elbow Method for Optimal K', fontsize=13, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("./_output/K-Means_clustering.png")
plt.show()

# print("\nInertia values by K:")
# print(elbow_df)
Figure 1: K-Means Clustering for the Job Posting Data

ELBOW METHOD The inertia drops sharply from 2 to 4 clusters, showing that most of the meaningful structure in the data is captured within this range. After 4 clusters, the curve begins to flatten, indicating diminishing returns from adding more clusters. This pattern suggests that K = 4 is the optimal and most efficient choice for segmenting the dataset ### Apply K-Means with Optimal K

# Choose optimal K (typically where elbow occurs, around 3-4)
optimal_k = 4
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_cluster)
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df_cluster['Cluster'] = kmeans.fit_predict(X_scaled)

#print(f"\nClustering complete with K={optimal_k}")
#print("\nCluster distribution:")
#print(df_cluster['Cluster'].value_counts().sort_index())

# Analyze cluster characteristics
#print("\nCluster Characteristics:")
cluster_summary = df_cluster.groupby('Cluster').agg({
    'AVG_SALARY': ['mean', 'median'],
    'EXPERIENCE_YEARS': 'mean',
    'DURATION_DAYS': 'mean',
    'IS_REMOTE': 'mean'
}).round(2)

cluster_summary.to_csv("./_output/cluster_summary.csv")
print(cluster_summary)

Cluster 0 represents higher-paying roles with moderate experience requirements and shorter durations, mostly non-remote. Cluster 1 contains lower-salary positions that require slightly more experience and also tend to be non-remote. Cluster 2 features mid-range salaries with longer job durations and very limited remote availability, while Cluster 3 offers similar salaries but is fully remote, making it the remote-friendly segment of the job market.

clst_sum = pd.read_csv("./_output/cluster_summary.csv")

clst_sum.style.hide(axis="index")
Unnamed: 0 AVG_SALARY AVG_SALARY.1 EXPERIENCE_YEARS DURATION_DAYS IS_REMOTE
nan mean median mean mean mean
Cluster nan nan nan nan nan
0 114309.83 111978.57 3.88 16.26 0.0
1 77114.7 78684.4 5.37 15.91 0.0
2 95016.68 94501.34 4.39 41.52 0.06
3 94725.13 95721.16 4.41 19.64 1.0

14.4.2 PCA Visualization of Clusters

# Apply PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df_cluster['PC1'] = X_pca[:, 0]
df_cluster['PC2'] = X_pca[:, 1]

# Create scatter plot with custom colors
plt.figure(figsize=(8, 5))
cluster_palette = ['#E91E63', '#9B59B6', '#F44336', '#2196F3']
sns.scatterplot(data=df_cluster, x='PC1', y='PC2', hue='Cluster', 
                palette=cluster_palette, s=60, alpha=0.7, 
                edgecolor='white', linewidth=0.3)
plt.xlabel('First Principal Component', fontsize=10, fontweight='bold')
plt.ylabel('Second Principal Component', fontsize=10, fontweight='bold')
plt.title(f'BA/DS/ML Job Clusters (K={optimal_k})', fontsize=11, fontweight='bold', pad=15)
plt.legend(title='Cluster', fontsize=9, title_fontsize=10, 
           frameon=True, fancybox=True, shadow=True)
plt.grid(True, alpha=0.2, linestyle='--')
plt.tight_layout()
plt.savefig("./_output/pca_plot.png")
plt.show()

#print(f"\nVariance explained:")
#print(f"PC1: {pca.explained_variance_ratio_[0]:.2%}")
#print(f"PC2: {pca.explained_variance_ratio_[1]:.2%}")
#print(f"Total: {sum(pca.explained_variance_ratio_):.2%}")

the first principal component explains 26.95% of the total variance and the second explains 25.08%, for a combined total of 52.03%. The PCA scatter plot maps the job postings onto these two components and shows four clusters formed using K-means (K=4), with each cluster occupying its own region despite some overlap.

Figure 2: PCA for the Job Posting Data

14.5 Classification: Role Type Prediction

Understanding the distinguishing characteristics of different role types can help job seekers tailor their applications and skill development.

14.5.1 Create Role Categories

def categorize_role(title):
    """Categorize job titles into BA, DS, ML, or Data Analytics"""
    if pd.isna(title):
        return 'Other'
    title_lower = str(title).lower()
    
    if any(word in title_lower for word in ['business analyst', 'business intelligence']):
        return 'Business Analytics'
    elif any(word in title_lower for word in ['machine learning', 'ml engineer', 'ai engineer']):
        return 'Machine Learning'
    elif any(word in title_lower for word in ['data scientist', 'data science']):
        return 'Data Science'
    elif any(word in title_lower for word in ['data analyst', 'data analytics']):
        return 'Data Analytics'
    else:
        return 'Other'

# Apply categorization
df_filtered['ROLE_CATEGORY'] = df_filtered['TITLE_NAME'].apply(categorize_role)

# Filter to main categories
main_categories = ['Business Analytics', 'Data Science', 'Machine Learning', 'Data Analytics']
df_clf = df_filtered[df_filtered['ROLE_CATEGORY'].isin(main_categories)].copy()

#print(f"Classification dataset: {len(df_clf):,} samples")
#print("\nRole distribution:")
role_dist = df_clf['ROLE_CATEGORY'].value_counts()
#print(role_dist)
#print("\nPercentages:")
#print(role_dist / len(df_clf) * 100)
role_dist.to_csv("./_output/Role_Categories.csv")
print(role_dist)
ROLE_CATEGORY
Data Analytics        11944
Business Analytics     1776
Data Science            419
Machine Learning          4
Name: count, dtype: int64
class_sum = pd.read_csv("./_output/Role_Categories.csv")

class_sum.style.hide(axis="index")
ROLE_CATEGORY count
Data Analytics 11944
Business Analytics 1776
Data Science 419
Machine Learning 4

14.5.2 Prepare Classification Features

# Get top states
top_states = df_clf['STATE_NAME'].value_counts().head(10).index

# Prepare features for classification
clf_feature_cols = ['EXPERIENCE_YEARS', 'DURATION_DAYS', 'IS_REMOTE', 'AVG_SALARY']

# Add state features
for state in top_states:
    col_name = f'STATE_{state}'
    df_clf[col_name] = (df_clf['STATE_NAME'] == state).astype(int)
    clf_feature_cols.append(col_name)

# Prepare X and y
X_clf = df_clf[clf_feature_cols].fillna(0)
y_clf = df_clf['ROLE_CATEGORY']

#print(f"Classification features: {len(clf_feature_cols)}")
#print(f"Samples per class:")
#print(y_clf.value_counts())

# Save feature columns as DataFrame
pd.DataFrame(clf_feature_cols, columns=['feature']).to_csv("./_output/clf_feature_cols.csv", index=False)
print(clf_feature_cols)

# Train-test split
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_clf, y_clf, test_size=0.3, random_state=42, stratify=y_clf
)

# Scale features
scaler_clf = StandardScaler()
X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)
X_test_clf_scaled = scaler_clf.transform(X_test_clf)
['EXPERIENCE_YEARS', 'DURATION_DAYS', 'IS_REMOTE', 'AVG_SALARY', 'STATE_California', 'STATE_Texas', 'STATE_Virginia', 'STATE_New York', 'STATE_Illinois', 'STATE_Florida', 'STATE_Ohio', 'STATE_Georgia', 'STATE_North Carolina', 'STATE_New Jersey']
clf_feature = pd.read_csv("./_output/clf_feature_cols.csv")

clf_feature.style.hide(axis="index")
feature
EXPERIENCE_YEARS
DURATION_DAYS
IS_REMOTE
AVG_SALARY
STATE_California
STATE_Texas
STATE_Virginia
STATE_New York
STATE_Illinois
STATE_Florida
STATE_Ohio
STATE_Georgia
STATE_North Carolina
STATE_New Jersey

14.5.3 Logistic Regression Classification

# Train logistic regression
lr = LogisticRegression(max_iter=1000, random_state=42, multi_class='multinomial')
lr.fit(X_train_clf_scaled, y_train_clf)
y_pred_lr = lr.predict(X_test_clf_scaled)
y_pred_proba_lr = lr.predict_proba(X_test_clf_scaled)

# Calculate metrics
acc_lr = accuracy_score(y_test_clf, y_pred_lr)
f1_lr = f1_score(y_test_clf, y_pred_lr, average='weighted')

#print("LOGISTIC REGRESSION CLASSIFICATION")
#print("=" * 50)
#print(f"Accuracy: {acc_lr:.4f} ({acc_lr*100:.2f}%)")
#print(f"F1 Score (Weighted): {f1_lr:.4f}")
#print("\nClassification Report:")
#print(classification_report(y_test_clf, y_pred_lr))

# Save metrics as DataFrames
pd.DataFrame({'accuracy': [acc_lr]}).to_csv("./_output/accuracy.csv", index=False)
pd.DataFrame({'f1_score': [f1_lr]}).to_csv("./_output/f1.csv", index=False)
#print(f"Accuracy: {acc_lr:.4f}")
#print(f"F1 Score: {f1_lr:.4f}")
acc = pd.read_csv("./_output/accuracy.csv")
acc.style.hide(axis="index")
accuracy
0.841150
f1 = pd.read_csv("./_output/f1.csv")
f1.style.hide(axis="index")
f1_score
0.774877

The logistic regression model reaches 84% accuracy, but this is mainly because it predicts most entries as “Data Analytics,” the largest class in the dataset. While the model performs well for this category, it struggles to recognize smaller roles like Business Analytics, Data Science, and Machine Learning, which show very low recall and F1-scores. This imbalance means the model is not effectively distinguishing minority roles and is primarily learning from the dominant class rather than providing balanced prediction

14.5.4 Random Forest Classification

# Train random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=15, 
                                min_samples_split=10, random_state=42, n_jobs=-1)
rf_clf.fit(X_train_clf, y_train_clf)
y_pred_rf_clf = rf_clf.predict(X_test_clf)
y_pred_proba_rf = rf_clf.predict_proba(X_test_clf)

# Calculate metrics
acc_rf_clf = accuracy_score(y_test_clf, y_pred_rf_clf)
f1_rf_clf = f1_score(y_test_clf, y_pred_rf_clf, average='weighted')

#print("RANDOM FOREST CLASSIFICATION")
#print("=" * 50)
#print(f"Accuracy: {acc_rf_clf:.4f} ({acc_rf_clf*100:.2f}%)")
#print(f"F1 Score (Weighted): {f1_rf_clf:.4f}")
#print("\nClassification Report:")
#print(classification_report(y_test_clf, y_pred_rf_clf))

# Save metrics as DataFrames
pd.DataFrame({'accuracy': [acc_rf_clf]}).to_csv("./_output/accuracy_rf.csv", index=False)
pd.DataFrame({'f1_score': [f1_rf_clf]}).to_csv("./_output/f1_rf.csv", index=False)
#print(f"Accuracy: {acc_rf_clf:.4f}")
#print(f"F1 Score: {f1_rf_clf:.4f}")
acc_random = pd.read_csv("./_output/accuracy_rf.csv")
acc_random.style.hide(axis="index")
accuracy
0.856469
f1_random = pd.read_csv("./_output/f1_rf.csv")
f1_random.style.hide(axis="index")
f1_score
0.814503

The model reaches a high overall accuracy of 85.6%, but this is influenced by the extreme class imbalance in the dataset. It predicts the dominant Data Analytics category very well, yet performs poorly on the smaller groups -Business Analytics, Data Science, and Machine Learning which is leading to a low F1 score of 0.33. This shows that the model is not generalizing effectively across all role types. To achieve more balanced and reliable results, techniques such as oversampling, class weighting, or rebalancing the dataset would be needed.

14.5.5 Classification Model Comparison

#### ROC Curves - Logistic Regression
#| fig-cap: "Logistic Regression ROC Curves"
#| echo: true
#| eval: true

# Binarize labels for ROC curve
classes = lr.classes_
y_test_bin = label_binarize(y_test_clf, classes=classes)
n_classes = len(classes)

# Color palette for classes
colors = cycle(['#E91E63', '#9B59B6', '#F44336', '#2196F3'])

# Create figure
plt.figure(figsize=(8, 6))

for i, color, class_name in zip(range(n_classes), colors, classes):
    fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_pred_proba_lr[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=color, lw=2.5, 
            label=f'{class_name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Chance (AUC = 0.500)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('Logistic Regression: ROC Curves', fontsize=14, fontweight='bold', pad=15)
plt.legend(loc="lower right", fontsize=10, frameon=True, fancybox=True, shadow=True)
plt.grid(True, alpha=0.3, linestyle='--')
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig("./_output/roc_lr.png", dpi=300, bbox_inches='tight')
plt.show()

The ROC curves show that the logistic regression model has moderate ability to distinguish between the different role categories, with AUC scores ranging from 0.58 to 0.77. Machine Learning achieves the highest AUC (0.775), suggesting the model can separate this class reasonably well despite its tiny sample size, while Data Science has the weakest separability (0.585). Overall, the curves indicate that the classifier performs above random chance for all roles but still struggles to clearly differentiate between them, reflecting the impact of class imbalance and overlapping feature patterns.

14.5.5.1 ROC Curves - Random Forest

# Create figure
plt.figure(figsize=(8, 6))

colors = cycle(['#E91E63', '#9B59B6', '#F44336', '#2196F3'])
for i, color, class_name in zip(range(n_classes), colors, classes):
    fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_pred_proba_rf[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=color, lw=2.5, 
            label=f'{class_name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Chance (AUC = 0.500)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('Random Forest: ROC Curves', fontsize=14, fontweight='bold', pad=15)
plt.legend(loc="lower right", fontsize=10, frameon=True, fancybox=True, shadow=True)
plt.grid(True, alpha=0.3, linestyle='--')
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig("./_output/roc_rf.png", dpi=300, bbox_inches='tight')
plt.show()

Random Forest ROC Curves

The Random Forest model shows improved class separability compared to logistic regression, with AUC values ranging from 0.60 to 0.84. Machine Learning achieves the strongest performance (AUC = 0.842), indicating the model can distinguish this role well despite its tiny sample size. Business Analytics and Data Analytics also show moderate discrimination, while Data Science remains the most challenging class, reflecting overlapping features and limited data representation.

14.5.5.2 Model Performance Comparison

comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest'],
    'Accuracy': [acc_lr, acc_rf_clf],
    'F1 Score': [f1_lr, f1_rf_clf]
})

plt.figure(figsize=(8, 6))
x = np.arange(len(comparison_df['Model']))
width = 0.35

bars1 = plt.bar(x - width/2, comparison_df['Accuracy'], width, 
               label='Accuracy', color='#E91E63', alpha=0.8, edgecolor='white', linewidth=1.5)
bars2 = plt.bar(x + width/2, comparison_df['F1 Score'], width, 
               label='F1 Score', color='#9B59B6', alpha=0.8, edgecolor='white', linewidth=1.5)

plt.ylabel('Score', fontsize=12, fontweight='bold')
plt.xlabel('Model', fontsize=12, fontweight='bold')
plt.title('Model Performance Comparison', fontsize=14, fontweight='bold', pad=15)
plt.xticks(x, comparison_df['Model'])
plt.legend(fontsize=10, frameon=True, fancybox=True, shadow=True)
plt.ylim([0, 1.1])
plt.grid(True, alpha=0.3, axis='y', linestyle='--')
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}', ha='center', va='bottom', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig("./_output/model_comparison.png", dpi=300, bbox_inches='tight')
plt.show()

Model Performance Comparison

The comparison shows that Random Forest outperforms Logistic Regression, achieving both higher accuracy (85.6%) and a stronger F1 score (0.815). This indicates that Random Forest handles the complex and imbalanced role categories more effectively. Overall, while both models perform well, Random Forest delivers more balanced and reliable predictions across the dataset.

14.5.5.3 Feature Importance Analysis

clf_importance = pd.DataFrame({
    'Feature': clf_feature_cols,
    'Importance': rf_clf.feature_importances_
}).sort_values('Importance', ascending=False).head(10)

plt.figure(figsize=(8, 6))
bars = plt.barh(range(len(clf_importance)), clf_importance['Importance'], 
               color=['#E91E63', '#9B59B6', '#F44336', '#2196F3'] * 3, 
               alpha=0.8, edgecolor='white', linewidth=1.5)
plt.yticks(range(len(clf_importance)), clf_importance['Feature'])
plt.xlabel('Feature Importance', fontsize=12, fontweight='bold')
plt.title('Top 10 Predictive Features (Random Forest)', fontsize=14, fontweight='bold', pad=15)
plt.grid(True, alpha=0.3, axis='x', linestyle='--')
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Add value labels
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width, bar.get_y() + bar.get_height()/2.,
            f'{width:.3f}', ha='left', va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig("./_output/feature_importance.png", dpi=300, bbox_inches='tight')
plt.show()

Top 10 Predictive Features (Random Forest)

The feature importance results show that *experience years, average salary, and job duration are the strongest predictors in distinguishing between BA, Data Science, ML, and Data Analytics roles. Remote status contributes modestly, while location-based features (state variables) have minimal impact, indicating that job role differences are driven more by skill level and job characteristics than by geography. Overall, the model relies most heavily on experience and salary patterns to differentiate job categories.

15 Conclusions and Recommendations

15.1 Summary of Findings

Our comprehensive analysis of the 2024 job market reveals several critical insights for job seekers in data analytics and related fields:

15.1.1 Market Demand

  1. Enterprise Software Dominance: SAP Applications (1,114 postings) and Oracle Cloud (667 postings) lead the market, indicating strong demand for enterprise system expertise
  2. Foundational Skills: Microsoft Office (637), Data Analysis (343), and SQL (357) remain essential across all role types
  3. Emerging Specializations: Cloud computing, UX design, and specialized analytics tools show growing importance

15.1.2 Skill Requirements by Role

Our analysis identified distinct skill patterns across job categories:

  • Data Analysts: Strong emphasis on SQL (7.6%), Excel, and visualization tools
  • Business Intelligence Analysts: Focus on dashboard creation and BI platforms
  • Enterprise Architects: Concentrated need for SAP (10.5%) and enterprise system knowledge
  • Data Scientists: Requirements span machine learning, Python, and statistical analysis

15.1.3 Skill Gap Findings

Comparing team member skills against market requirements revealed:

  • Critical Gaps: Enterprise software (SAP, Oracle), advanced cloud platforms
  • Strengths to Leverage: SQL, Python, data analysis fundamentals
  • Development Priorities: Cloud certifications, BI tool proficiency, specialized domain knowledge

15.2 Recommendations for Job Seekers

Based on our analysis, we recommend the following strategies:

15.2.1 Immediate Actions

  1. Build Enterprise Software Skills: Pursue SAP or Oracle Cloud certifications to access high-demand roles
  2. Master Core Tools: Ensure proficiency in SQL, Excel, and at least one BI platform (Power BI or Tableau)
  3. Develop Cloud Competency: Gain hands-on experience with AWS or Azure
  4. Strengthen Communication Skills: The appearance of “Communication” in top skills emphasizes soft skill importance

15.2.2 Medium-Term Development

  1. Specialize Strategically: Choose a specialization aligned with career goals (UX Design, Cloud Architecture, ML Engineering)
  2. Build Portfolio Projects: Demonstrate skills through practical projects using enterprise-relevant tools
  3. Pursue Relevant Certifications: Industry certifications significantly boost marketability
  4. Network in Target Industries: Connect with professionals in roles requiring your target skill set

15.2.3 Career Positioning

  1. Tailor Applications: Customize resumes to highlight skills matching specific job requirements
  2. Emphasize Skill Combinations: Jobs often require complementary skill pairs (SQL + Cloud, Data Analysis + BI Tools)
  3. Target Growth Areas: Focus on roles in industries showing strong hiring patterns
  4. Consider Geographic Factors: Location significantly impacts both opportunities and salary expectations

15.3 Limitations and Future Work

15.3.1 Study Limitations

  1. Data Currency: Analysis based on 2024 snapshot; market evolves rapidly
  2. Geographic Scope: Dataset may not represent all regional markets equally
  3. Skill Extraction: NLP methods capture explicit skill mentions but may miss implicit requirements
  4. Temporal Factors: Seasonal hiring patterns not fully captured

15.3.2 Future Research Directions

  1. Longitudinal Analysis: Track skill demand trends over multiple years
  2. Salary Prediction Enhancement: Incorporate additional features (company size, benefits, remote status)
  3. Real-Time Monitoring: Develop dashboard for continuous market tracking
  4. Industry-Specific Analysis: Deep dive into particular sectors (FinTech, Healthcare, etc.)
  5. Network Analysis: Explore skill co-occurrence patterns and career pathway modeling

15.4 Final Thoughts

The 2024 job market for data analytics professionals presents significant opportunities for those who strategically develop their skill sets. While traditional foundations (SQL, Excel, data analysis) remain essential, the market increasingly rewards specialization in enterprise systems, cloud platforms, and advanced analytics tools.

Success in this market requires a balanced approach: maintaining strong fundamentals while developing expertise in high-demand specialized areas. The skill gap analysis methodology presented in this report provides a replicable framework for continuous career development assessment.

By leveraging these insights and recommendations, job seekers can position themselves competitively in a dynamic and evolving market landscape.

16 References

Georgieff, Alexandre, and Remi Hyee. 2022. “Artificial Intelligence and Employment: New Cross-Country Evidence.” Frontiers in Artificial Intelligence 5: 832736. https://doi.org/10.3389/frai.2022.832736.
Gerhart, Nicholas, Ehsan Rastegari, and Elizabeth Cole. 2024. “Analytics: What Do Business Majors Need and Where Do They Get It?” Journal of Statistics and Data Science Education 1: 1–28. https://doi.org/10.1080/26939169.2024.2393427.
Graham, C. M. 2025. “AI Skills in Cybersecurity: Global Job Trends Analysis.” Information & Computer Security.
Liu, J., K. Chen, and W. Lyu. 2024. “Embracing Artificial Intelligence in the Labour Market: The Case of Statistics.” Humanities and Social Sciences Communications.
Liu, Ye, and Chuanning Li. 2024. “Insights into Talent Cultivation in Big Data Management and Application Major Based on Recruitment Information.” Procedia Computer Science 242: 568–75. https://doi.org/10.1016/j.procs.2024.08.107.
Smaldone, Francesco, Adelaide Ippolito, Jelena Lagger, and Marco Pellicano. 2022. “Employability Skills: Profiling Data Scientists in the Digital Labour Market.” European Management Journal 40 (5): 671–84. https://doi.org/10.1016/j.emj.2022.05.005.
U.S. Bureau of Labor Statistics. 2024. “Occupational Employment and Wage Statistics: Data Scientists (15-2051).” Washington, D.C.: U.S. Department of Labor.

16.1 Data Sources

  • Lightcast Job Postings Dataset (2024)
  • U.S. Bureau of Labor Statistics
  • LinkedIn Skills Assessment Data

16.2 Tools and Technologies

  • Python 3.x (pandas, numpy, scikit-learn, seaborn, matplotlib)
  • Quarto Publishing System
  • R Programming Language
  • Jupyter Notebooks

Report Generated: r Sys.Date()

Analysis Period: January 2024 - December 2024

Total Job Postings Analyzed: 72,498